Talk title:
The Challenges of Building Practical Knowledge-Based Systems
by Professor Bob Futrelle, NU CCIS - 24 February 2003
Our (knowledge) heritage: People, artifacts and especially: documents.
I focus on the scientific literature, especially the biology research literature.
But the methods I develop are broadly applicable.
The biology/biomedical literature is the largest scientific literature.
The National Library of Medicine, through its web searchable PubMed
database,
http://www.ncbi.nlm.nih.gov/PubMed/, is searched 30M times per month.
My history in this: Theoretical Physics (MIT), 1966. Hooked up with some
physicists working on biology, 1972. Moved to biology and was on the
biology faculty at U. Illinois (Urbana/Champaign) until 1985. Moved to Northeastern
in 1986. Set up the Biological Knowledge Laboratory (BKL) in 1989 supported
by a large NSF grant (National Science Foundation). From the very beginning,
starting in 1958 with my MIT UG thesis, I have been constantly involved
with Computer Science.
The challenges I've set for myself in my research. Why they're important
and why they're hard.
- Not just retrieving documents by extracting knowledge from docs.
- What kind of a thing is this "knowledge"?
- Example: Having a system tell you that, based on the literature,
A controls B. (Based on words alone, this could as easily have been
B controls A.)
- Example: Having a system find a particular diagram with a certain content.
(Can't always tell from the caption alone, since the diagram is there to
show you things that cannot or need not be explained in text.)
- If the system just finds documents, then you have to read them.
Instead, if it finds answers, it can save you huge amounts
of work (and reading).
- These problems tend to be "AI-complete". That is, you have to know
everything to understand any particular thing.
And I've said this before.
- How I've gone about this work is described below.
We started early, 1989, before the WWW was properly established.
(First official release of the original Mosaic browser was Nov. 1993.)
In spite of that, we built a hypertext system with graphics and text links
and published a description of our system in 1993 -- but it was not
web or internet based.
Our more serious natural language work involved clustering to discover
semantic relations as well as "hard-core" parsing, using parsing technology
from Cambridge U. (Various papers listed in my online CV.)
Explain the clustering concept and the following example.
We also worked, early on, on diagram parsing, because diagrams are
very important in essentially every biology paper.
This work was done Lisp on the Macintosh, because Lisp could handle
the symbolic computations and the Mac could do the required graphics.
Here's some of this as seen on my Diagram Demo site at
http://www.ccs.neu.edu/home/futrelle/diagrams/demo-10-98/
Jumping ahead to current work
I work with a number of graduate students and undergraduates (some of you!)
on a number of projects.
- We have licensed some 50,000 full-length online papers from the
American Society for Microbiology, a major publisher. These comprise
about 300M words and half a million diagrams.
- We have our own copy of Oracle for a research, on a Sun box
with 8GB of memory.
- Some of the PDF versions of the papers have vector graphics
and we have parsed the PDF and extracted and split up the figures
(work w. Mingyan Shao).
- We have used machine learning (support vector machines, SVM)
to classify these diagrams (work with Chris Cieslik, a Senior). (paper submitted)
- The overwhelming percentage of diagrams in online papers are
in raster form (gifs and jpegs). To analyze these we need
to vectorize them. We are developing the M3 System
(Moment/Model Method) with a new set of techniques to do very
high quality vectorization. Here's a
typical figure to
illustrate the problems and approaches.
Here's an image and its
vectorized form after a run of our M3 prototype. (Work primarily
by Dan Crispell, ECE UG minoring in CS. Mike Preshman, CS UG, has also
worked on this project.)
- On the natural language side, we are eschewing parsing in the conventional
sense and looking for common patterns of expression in which a "framework"
contains the information rich "data" items. A common example is
"The current temperature is 90 degrees." (not in February, unfortunately).
The data item is "90" and the rest is the much more common framework.
Here is a page
that shows how we can use data visualization to see these
patterns. They can be found by looking for low-frequency items contained
in patterns made up of higher frequency items. We are building visualization
tools using dynamically generated displays and lazy evaluation to visualize
hundreds of thousands of these patterns in a single scrolling window.
All work done in Java/Swing. (run code here)
In /Users/bob/Research/NLP-BKL/AndreaGrimes/viewerV0.1/ do
java TextViewApp 1
All work on the viewer by Andrea Grimes, UG.
- I am working on reimplementing the entire diagram parsing system
in Java (JDUS) with with the viewing system JDUSI in Swing.
The underlying data structures will be similar to PDF.
- We are interested in the relations of text and graphics and have
even done experiments tracking how people deal with the two together.
See this figure, for example.
(Work with Anna Rumshisky, PhD student, now in PhD program at Brandeis.
Continuing development here by Chen Zhang, UG, and currently by
Stephanie Fillion, UG).
-