Freshman Honors, Sp04 - E-Graphics - Prof. Futrelle

Professor Futrelle - College of Computer and Information Sciences, Northeastern U., Boston, MA

Version of 11 January 2003

E-Docs (online electronic documents) are full of graphics = E-Graphics

Google and other search systems are pretty good at indexing text. But if you want a diagram about certain bones (like one your friend just broke!) or a diagram showing a certain football play, good luck. Google can sometimes find the pictures you want, but more often it's a very roundabout search. There's no such thing as a diagram query system that allows you to point at one and say you want another like that but about something a bit different. And there is certainly no system that allows you to draw something and ask it to return something "like it". Google image search can return lots of images, but you rarely find out who created the image, where a photo was shot from, etc.

The crucial importance of figures in research papers

In Biology, the largest scientific literature, the results presented in the figures and the discussion of them make up about half of every paper. In a recent research proposal I submitted to the National Science Foundation, I discussed some of these issues: Half of all the research results in Biology are inaccessible to information retrieval systems. The proposal describes a concerted attack on the difficult problems involved in "unlocking" the content of these millions of figures so that the required information retrieval systems can be built. The two problems that need to be solved are first, the conversion, or vectorization, of the raster-based figure images to an object-based form, and second, the analysis, or parsing, of the objects to generate a structured description of the figure content, which might be a data graph, a gene diagram or a gel photo. It is the former problem, vectorization, that I'm my first talk in this Honors seminar is about. The goal of the research I proposed is to build a system that can turn the millions of figures in the literature into information objects that can be stored, indexed and retrieved in a new generation of information retrieval systems. Information retrieval systems are important projects for the future, but the proposed project focuses on the vectorization and parsing tasks that form the basis for future retrieval systems.

There are two things to say about the future of E-Graphics. The first is that research such as I'm proposing is essential if we're to "unlock" graphics so that it's readily accessible. The second is that the research has to gradually be adopted or adapted by authors, publishers and information systems providers if E-Graphics are to become a rich and usable resource. It goes without saying that a lot more research than just mine will be needed to make all this happen. Don't forget, an undergraduate in my lab, Dan Crispell, was the force behind the implementation of our first major attack on the vectorization problem.

The figure below is a typical one from a recent Biology research paper, the kind I'm proposing to vectorize and parse. Click on the figure to see it full size in another window.

Return to Prof. Futrelle's Sp04 Honors homepage or his Teaching Gateway or homepage