This page gives brief descriptions of six recent papers by Bob Futrelle, as well as links to versions of them in PDF format, downloadable using a browser. Created 12/3/99 and residing at http://www.ccs.neu.edu/home/futrelle/6papers/6papers-wuj.html.

Building systems based on ontologies of biology

"Creating a Knowledge Base of Biological Research Papers" in ISMB94.

ABSTRACT: Intelligent text-oriented tools for representing and searching the biological research literature are being developed, which combine object-oriented databases with artificial intelligence techniques to create a richly structured knowledge base of Materials and Methods sections of biological research papers. A knowledge model of experimental processes, biological and chemical substances, and analytical techniques is described, based on the representation techniques of taxonomic semantic nets and knowledge frames. Two approaches to populating the knowledge base with the contents of biological research papers are described: natural language processing and an interactive knowledge definition tool.

Automated summarization, emphasizing diagrams

"Summarization of Diagrams in Documents" In: Advances in Automated Text Summarization, Ed: I. Mani and M. Maybury. Cambridge, MA, MIT Press (1999). This is one of the few (the only?) papers on the automated summarization of diagrams, as opposed to text.

ABSTRACT: Documents are composed of text and graphics. There is substantial work on automated text summarization but almost none on the automated summarization of graphics. Four examples of diagrams from the scientific literature are used to indicate the problems and possible solutions: a table of images, a flow chart, a set of x,y data plots, and a block diagram. Manual summaries are first constructed. Two sources of information are used to guide summarization. The first is the internal structure of the diagram itself, its topology and geometry. The other is the text in captions, running text, and within diagrams. The diagram structure can be generated using the author's constraint-based diagram understanding system. Once the structure is obtained, procedures such as table element elision or subgraph deletion are used to produce a simpler summary form. Then automated layout algorithms can be used to generate the summary diagram. Current work on parsing and automated layout are briefly reviewed. Because automated diagram summarization is a brand-new area of inquiry, only the parsing phase of the approach has been fully implemented. Because of the complexity of the problem, there will be no single approach to summarization that will apply to all kinds of diagrams.

Natural language analysis for the biology literature

This paper is probably the most readable one we have produced that covers a number of techniques for natural language analysis and its relation to building knowledge bases. Figures 2 and 3 in the paper are especially interesting.

Corpus linguistics for establishing the natural language content of digital library documents. In N. R. Adam, B. K. Bhargava & Y. Yesha (Eds.), Digital Libraries. Current Issues (pp. 165-180). Berlin: Springer-Verlag, 1995.

Abstract Digital Libraries will hold huge amounts of text and other forms of information. For the collections to be maximally useful, they must be highly organized with useful indexes and intra-and inter-document linkages. This brings with it a demand for ever-better methods for automated analysis of text to build the indexes and links. It requires turning implicit information, "encrypted in natural language" into explicit information. We discuss approaches to the automation task built on the techniques of corpus linguistics. This paper focuses on word classification as an example of the utility of corpus methods. Results are presented for the syntactic and semantic classification of words from a biological corpus. The word classes identified can then be used for indexing, query expansion, syntactic analysis and for linking separate library collections by aligning word senses. The paper also discusses derivative objects, diagram analysis and authoring tools. Finally, we outline a new approach to word classification and other language structure analyses based on the minimal complexity principle, in turn based on the theory of Kolmogorov complexity.

Mining Medline to find the time course of discoveries

This paper describes the results of a brief foray into discovering significant research events by mining Medline. "Automated detection of emerging research areas and discoveries in the biological research literature" (Submitted to PSB2000, not accepted)

ABSTRACT: Most scientists want to be kept informed of the latest discoveries in their specialty and in their field at large. The emergence of such discoveries and new research areas in the sciences is accompanied by the appearance of new terminology, new word associations, and new document word clusters in the literature. We have been able to detect the emergence of new areas in biological research by studying the time course of the frequency of novel terms and the strength of term associations found from Medline retrieval statistics over a multi-year period. We show that past discoveries, as evidenced by Nobel prizes and highly cited research papers ("hot papers"), can be detected by such methods. To implement the searches, a Web robot was constructed that searches Medline for phrases and pairs of phrases closely related to known discoveries. The work reported here is retrospective and deals with previously identified episodes, but the techniques can obviously be extended to discover emerging research trends without prior identification by monitoring publications as they become available (on-line discovery). Since such studies would not use directed searches, this would require access to the entire corpus, or at least substantial portions of it, something not attempted in this work. It should be possible to extend our approach to elucidate more details of bifurcations and hybridization of research areas, shifts in terminology, and the decline of older lines of inquiry. A more powerful approach to these problems could be based on document clustering using the full text of abstracts or even entire papers, but this paper shows that the far simpler and less computationally intensive approach based on term frequencies and term associations can yield significant results.

Two diagram papers

The first paper here is the 'standard reference' to our diagram parsing work. "Efficient Analysis of Complex Diagrams using Constraint-Based Parsing". In Intl.Conf. Document Analysis and Recognition, ICDAR95.

ABSTRACT: This paper describes substantial advances in the analysis (parsing) of diagrams using constraint grammars. The addition of set types to the grammar and spatial indexing of the data make it possible to efficiently parse real diagrams of substantial complexity. The system is probably the first to demonstrate efficient diagram parsing using grammars that easily be retargeted to other domains. The work assumes that the diagrams are available as a flat collection of graphics primitives: lines, polygons, circles, Bezier curves and text. This is appropriate for future electronic documents or for vectorized diagrams converted from scanned images. The classes of diagrams that we have analyzed include x,y data graphs and genetic diagrams drawn from the biological literature, as well as finite state automata diagrams (states and arcs). As an example, parsing a four-part data graph composed of 133 primitives required 35 sec using Macintosh Common Lisp on a Macintosh Quadra 700.

The last paper is a recent one on the problem of ambiguity in the parsing and analysis of diagrams. It is one of the few (the only?) papers on this topic. "Ambiguity in Visual Language Theory and its Role in Diagram Parsing" In: IEEE Symposium on Visual Languages, VL99, Tokyo, 1999.

To take advantage of the ever-increasing volume of diagrams in electronic form, it is crucial that we have methods for parsing diagrams. Once a structured, content-based description is built for a diagram, it can be indexed for search, retrieval, and use. Whenever broad-coverage grammars are built to parse a wide range of objects, whether natural language or diagrams, the grammars will overgenerate, giving multiple parses. This is the ambiguity problem. This paper discusses the types of ambiguities that can arise in diagram parsing, as well as techniques to avoid or resolve them. One class of ambiguity is attachment, e.g., the determination of what graphic object is labeled by a text item. Two classes of ambiguities are unique to diagrams: segmentation and occlusion. Examples of segmentation ambiguities include the use of a portion of a single line as an entity itself. Occlusion ambiguities can be difficult to analyze if occlusion is deliberately used to create a novel object from its components. The paper uses our context-based constraint grammars to describe the origin and resolution of ambiguities. It assumes that diagrams are available as vector graphics, not bitmaps.