Distributed Systems Seminar

TITLE:  Building a State of the Art Text Understanding System:
        		How XML Can Help
SPEAKER:  Sergey Bratus
TIME:  Friday, 2:00 p.m., 149 Cullinane Hall, Northeastern University

  ( Sergey Bratus received his Ph.D. from Northeastern University
    in 1999.  He is currently working at BBN. )

XML is regarded as the foundation of the future distributed protocols,
and XML-related activity is seen as laying the groundwork for new
distributed application architectures. I am going to describe how we
used XML as the base for the architecture of state of the art text
understanding system.

The task of our system was to identify entities, such as persons,
organizations and locations, in newswire-like text, and report
relations between them, such as an organization being at a specific
locating. The system used statistical language models to find names of
entities in the input text, and to produce full syntactic parses for
each sentence in the text, and a number of further statistical modules
to identity descriptive mentions of the same entities in the text,
such as ``the company'', and to establish co-reference between those
and the name mentions. Each module enriched the output of its
predecessor in the pipeline.

This produced large amounts of structured information. Maintaining that
information, catching exceptional conditions due to malformed input,
visualizing the output together with the logged debugging information,
and regression testing the system after adjustments was no easy task.
We use an XML framework to represent all of our data, and were able to
successfully address these issues. I will talk of the tools we created
for this purpose, and of the insights into possible collaborative
distributed applications of this technology that we gained form this
work.