PhD talk, Professor Futrelle, 6 December 2002 - Page 4
How we are solving the problems - work in progress
We are currently supported by two NSF grants (National Science Foundation),
one for text (natural language understanding) and the other for diagrams.
- Download the 300 million words and store them in Oracle in a novel
sequence format using "extreme tokenization". Example: "1.2" is "1", ".", "2".
Everything based on UIDs, no strings involved in most of the work.
Take full advantage of DB indexing, Java hash tables in 8GB of RAM.
- Look for "standard containers" in text. Example:
"Tomorrow's high temperature will be 31 degrees."
The "31" has high information content (ala Shannon).
The rest is a "container". Map queries to containers
to produce answers for users. Example: "What will the high be tomorrow?"
- Use the rare PDF files with vector diagrams to investigate
vector issues. Example: Render vector to spatial index pyramid (SPAS)
and use object statistics to train a Support Vector Machine classifier
(95% accuracy by leave-on-out measure).
- Build a vectorization system that turns raster diagrams into
vector diagrams. Current systems ignore Moore's Law. Example:
We represent a megapixel image by an array of one million Java
objects, not integers. High-quality, model-based approach,
the Strategic Vectorization Project. Uses Swing visualization.
- Build a parser generator to solve the constraint-based
diagram parsing problem. Design semantic interpreters that walk
the parse tree and build meaning compositionally.
Previous
Next