Answers to some questions at Futrelle's PhD seminar, 6 December 2002

There were a number of good questions that handed in by the students at the PhD seminar. So here are some answers, expanding on points they were curious about.

In general, the questions asked for answers about things that we are working on or even just beginning to work on. So there are typically no pat answers for you. In some cases no real answers at all, just problems and ideas about how to pursue answers.

Machine Learning

Over-training in machine learning can be detected when the performance on new items to be classified that were not involved in the training, is notably worse than the performance on items that were used to train the system. See an excellent machine learning reference such as: T. M. Mitchell, Machine Learning New York: McGraw-Hill, 1997.

The Support Vector Machine implementation we use is SVM^light. See: http://svmlight.joachims.org/ For our initial work, we characterized each diagram by an 8 component vector of attributes such as # of horizontal lines, # of text items, etc. The vectors are used in inner products or in exponential forms to construct kernels that SVM^light then uses. One of the nice things about SVM methods is that the dimensionality of the learning computation is related to the number of training samples, not to the intrinsic dimensionality of the original problem (which could be infinite-dimensional, e.g., based on some continuous function).

Image Processing

There were many concerns about efficiency and performance. My point was that people are still focused on that, so they continue to develop algorithms which do not accomplish the task well. Their algorithms produce poor results quickly, and what's the good in that? My philosophy is to find an algorithm that does a high-quality job first. Once found, we can optimize it for speed or memory usage. When we've done all we can there, if we want more performance we just run it on a cluster. I need excellent conversion of raster images to vector images if we are to then hand them to a parser.

Image processing used to be time-consuming. It is getting less so every day.

The methods we are developing for extracting lines,etc., from raster images, are novel and powerful. The fundamental philosophy is simple: Instead of throwing away information, such as by choosing a threshold and converting each pixel to a binary value, either black or white, we retain full gray levels. Instead of looking at just a central "skeleton" or just the edges of a line, we fit a full line model to a portion of line. We are essentially doing a model-constrained inversion of the raster image to recover the original "perfect" image, an image that typically arose from a drawing application that was vector based. An ironic situation.

Meta-information in each "pixel": Some of this might include the identity of the two lines that cross such that the pixel lies within the intersection, the "status" of a pixel, e.g., what analyses have already been applied to it. Another might be information that a pixel is near a set of pixels that represent an object. Such "near" analysis is especially useful in the pyramidal representation of an image/object dataset, where the analysis is done higher in the pyramid at a lower level of resolution. As I said in my talk, some timing experiments showed that we could create a 1M array of Java HashSets, each with two unique objects, a total of 3M objects in about 1.4 secs on our Sun Blade 2000.

Do we analyze images? For the time being, we'll extract the text labels around images. In biology there are many gel diagrams that have labeled columns and "spots" in various vertical positions, some labeled. But that's all we'll do for now.

Diagram Parsing

We have been able to parse small collections of different data graphs and different gene diagrams. The challenge will be to increase the coverage of the grammars, to increase the number of different types of diagrams that can be successfully parsed. We're working hard on that. This will lead to syntactic analyses of diagrams (cf. semantics below).

If you want to know more, look at my diagram demo site.

I described an example of ambiguity, A|B|C|D|E| versus |A|B|C|D|E. Some suggestions as to how to deal with this are described in my ambiguity paper (PDF) below. Briefly, we need to look at the "simplest description". In this case that would be a one-to-one pairing of labels and lines with nothing left over. Such Minimal Description Length (MDL) approaches are the subject of Algorithmic Information Theory See this link, for example.

Futrelle, R. P. (1999). "Ambiguity in Visual Language Theory and its Role in Diagram Parsing" IEEE Symposium on Visual Languages, VL99. IEEE Computer Soc., Tokyo, pp. 172-175. [Link to paper]

I have paid careful attention to many studies of visual perception to help me understand how people see things. Also, I've spent many hours just looking at diagrams, so I'm very familiar with what's out there. For example, when I work, I almost always have a copy of the journal Science or Nature with me, so I have lots of good examples and stay grounded in reality.

Semantics

We have to be concerned with semantics for both diagrams and text. I intend to keep the semantics as "shallow" as possible. Otherwise we've set an impossible task for ourselves. To show that in many common cases, deep understanding is not needed for question answering systems, say that you are told that "Sally is down the hall zibbling the framenator." Without the foggiest idea of the meanings of the two odd words, you can still correctly answer questions such as "What is Sally zibbling." A deep semantic analysis is not needed. This is a literal and facile analysis, I agree, but there's an important principle lurking here, that needs exploration and elaboration.

Pursuing this topic further, people think they understand more than they do, so they think that deep analyses are necessary for semantics. They try it and their attempts founder. This is closely related to the "The illusion of explanatory depth". See Cognitive Science, Vol. 26 (5) (2002) pp. 521-562 for a recent study of this by Rozenblit and Keil at Yale.

Document representations

People asked about the "format" of our documents. Will it be XML, or what? The "format" will be large tables in relational databases. The entire DB will use only integer UIDs, not strings at all. There will be a few tables that map UIDs to strings, but all the rest will use only UIDs. The main use of strings will for internal analyses of text items, e.g., comparing "protein" to "proteins" and for presentation to human viewers, who need to see the strings. I view XML as merely an export format, not a serious format worthy of hard-core cutting edge research on scalable systems. Relational DBs have over 30 years of active development and allow insertions, indexing and retrieval that is extremely efficient. We are developing methods such that small changes or reinterpretations of portions of huge texts can be done locally, in essentially constant time, something impossible to do in XML.

As for why PDF and HTML and why not TeX, look at the largest site for scientific documents on line, HighWire Press, http://highwire.stanford.edu/ They currently have 12,352,341 articles in over 4,500 Medline journals, 445,195 free full text articles from 344 HighWire-hosted journals. And they're all in PDF or HTML format. Many of the articles are in both formats.

Text analysis -- Natural Language Processing/Understanding

Unlike deep parsing methods or purely statistical methods, I'm looking at "containers" (also called "skeletons) with embedded information. These are found using statistical (frequency-based methods). My example was:

Look for "standard containers" in text. Example: "Tomorrow's high temperature will be 31 degrees." The "31" has high information content (ala Shannon). The rest is a "container". Map queries to containers to produce answers for users. Example: "What will the high be tomorrow?"

One student correctly pointed out that HTML might be able to help parsing, not make it harder. That is true. For example, an important initial task is to find sentence boundaries. But HTML tells us that <p> is potentially a sentence start and </p> is a sentence end. But HTML is just messy. For example, Greek letters are typically represented as tiny GIFs. That's not hard to understand, but it's a pain to deal with.

Scalability: Remember that DB indexes are B-trees or hash indexes. The access times for these grow very slowly with DB size. True, the machine may take hours, or even days to build the indexes. But that's all right, because the text we deal with is archival; once a paper is published, it's not further edited.

Text diagram relations: A significant fraction of the sentences in biology research articles are either in figure captions or specifically refer to figures, e.g., "So-and-so happens, Fig. 2."

Someone mentioned FireSpout as a new firm that does fancy text analysis. They must have gone belly up, because the last discussions of them I can find were in early 2001. One site had removed links to them because the FireSpout site was no longer accessible. It's a tough business. I do know something of business, having worked in industry years ago and currently doing consulting.

I have a substantial site, http://www.bionlp.org, devoted to natural language processing of Biology text. It running at about 30,000 hits/year now, which is fine for such a specialized site.

Systems for searching

These are rather far down the road, though we do hope to short-circuit some questions, as described in the semantics section earlier. Also, we don't attempt to describe the contents of diagrams in detail, only to make them findable and accessible for viewing.

Research opportunities for you and your friends

It's obvious by now that my lab works on a number of interesting and important, but difficult projects. There are many opportunities in the lab to learn useful things and to publish papers, as well as doing reading courses, and Masters or PhD theses under my direction. I don't have a lot of funding at the moment, but I have a lot of research opportunities and a good size, high-quality group of undergraduates and graduates working with me. If you're interested, get in touch. And tell your friends here and elsewhere that there are opportunities here.

Other

One student asked a quite insightful question about how we might develop tools for authoring documents that would make them easier to analyze and index. We've addressed that issue in an earlier paper. See this MS Word version of the paper:

Futrelle, R. P., and N. Fridman. (1995) "Principles and Tools for Authoring Knowledge-Rich Documents" DEXA 95 (Database and Expert System Applications) Workshop on Digital Libraries, London, UK pp. 357-362. [Link to paper]

Another student asked about the potential similarities between text and diagram parsing. The connections are there, but they're subtle. Ask me again in a year or two.

My chapter on diagram summarization is here (PDF).

All our past systems are being redeveloped in Java. It does the many jobs we need done. (Java 1.4 will become the default on the CCIS Unix systems as of 18 December 2002.) I do my own code development on my Mac Ti Powerbook. But our JDBC code that will interact with Oracle will be developed on Solaris. Unfortunately the culture of the Lisp community did not lead it to develop the breadth and depth of the libraries and subsystems that I need to really work on a wide range of problems. Even the inventors of Java at first thought it would just be for applets.

Why scientific papers? They are where humankind records everything we have learned about the scientific basis of the world around us and the universe it's embedded in. Without it, we'd be living as humans lived tens of thousands of years ago. I let other people analyze the Wall Street Journal, people's email, etc.

The Restricted Focus Viewer (RFV) is much easier to use than expensive and cumbersome optical equipment that tracks eye movement. The RFV is a really cool and useful invention, believe me.

How do I know all this stuff? I was interested in technical things as a child, so I've been learning this stuff for nearly 60 years. I'm sure you'll know a lot too when you're my age. I have found that knowledge and experience in every field I've worked in has helped in every other, to some extent or another.

My advice: Study broadly; get the big picture; join the ACM and the IEEE Computer Society so you'll know where you're headed. Read journals, not just books. Many are on line through our library, free. Hang out in bookstores. I like Quantum Books in Kendall Square, but Borders and Barnes & Noble are good as is the Harvard Coop in Kendall Square and Harvard Square. Check out the Consortium Card in the Library that allows you to use libraries at other schools, even MIT. Go to seminars by visitors from other schools and industry. Don't just read, write!