PhD talk, Professor Futrelle, 6 December 2002

PhD talk, Professor Futrelle, 6 December 2002 - Page 4

How we are solving the problems - work in progress

We are currently supported by two NSF grants (National Science Foundation), one for text (natural language understanding) and the other for diagrams.

Download the 300 million words and store them in Oracle in a novel sequence format using "extreme tokenization". Example: "1.2" is "1", ".", "2". Everything based on UIDs, no strings involved in most of the work. Take full advantage of DB indexing, Java hash tables in 8GB of RAM.

Look for "standard containers" in text. Example: "Tomorrow's high temperature will be 31 degrees." The "31" has high information content (ala Shannon). The rest is a "container". Map queries to containers to produce answers for users. Example: "What will the high be tomorrow?"

Use the rare PDF files with vector diagrams to investigate vector issues. Example: Render vector to spatial index pyramid (SPAS) and use object statistics to train a Support Vector Machine classifier (95% accuracy by leave-on-out measure).

Build a vectorization system that turns raster diagrams into vector diagrams. Current systems ignore Moore's Law. Example: We represent a megapixel image by an array of one million Java objects, not integers. High-quality, model-based approach, the Strategic Vectorization Project. Uses Swing visualization.

Build a parser generator to solve the constraint-based diagram parsing problem. Design semantic interpreters that walk the parse tree and build meaning compositionally.

Previous Next