Information Extraction and Natural Languag Processing
Department of Information Science
University of Tokyo
Bunkyo-ku, Tokyo 113-0033 JAPAN
The papers below are aimed to illustrative of the work that has taken place in information extraction in the last five years, as of October 2000. Much of this work has been influenced by the DARPA-sponsored Message Understanding Conferences, which has subdivided the IE task into up to four distinct phases: named entity, coreference resolution, template element and scenario template. This has influenced a generation of IE systems, and this is reflected in the headings of papers given below as well.
The list is not complete and there are undoubtedly several papers that have been unintentionally missed out. In particular, the list of the named entity task is oriented towards systems that learn automatically from corpora. You find a list of papers on more structure-oriented methods in the list prepared by Dr.S.Ananiadou for the tutorial.
(Note: ** Key paper - shows some technique that has influenced the development of the field. * Recommended reading.)
I wish to thank the members of the GENIA project at the University of Tokyo. In. particular, the bibliography provided by Dr. N.Collier helped greatly to compile this annotated list.
1. General Introduction
(1-1)*Appelt, D., and Israel, D.(1999): Introduction to Information Extraction Technology, Tutorial for IJCAI-99, (http://www.ai.sri.com/~appelt/ie-tutorial/)
This is a good current state of the arts survey of the field, easy to read. A list of useful web sites is also given, such as publicly available linguistic resources, etc. A simple tool kit of IE is also available from the web site.
(1-2) Pazienza, M.(Ed.) (1999): Information Extraction, Lecture Notes in Artificial Intelligence 1714, Springer
A collection of papers which show the breadth of the field, including future directions (adaptability of IE systems, the relationships with other fields like digital library, IR, Speech, etc.) and description of a concrete system by the University of Rome.
(1-3) Grishman, R.(1998): Information Extraction and Speech Recognition, in Proceedings of the Broadcast News Transcription and Understanding Workshop (available from http://www.cs.nyu.edu/cs/projects/proteus/)
A short paper. While the second half is on IE and Speech, the first half of the paper is a brief explanation of IE. If you need a short description of IE, this gives you a good introduction to the field.
(1-4)*MUC-6, Proceedings of the Sixth Message Understanding Conference (MUC-6), Morgan Kaufmann, San Meteo, CA.
(1-5)*MUC-7 (1998), Proceedings of the Seventh Message Understanding Conference (MUC-7)}, published on the web site http://www.muc.saic.com/
The DARPA-sponsored Message Understanding Conferences (MUC) have greatly affected the development of the field, including concrete task definitions, evaluation methods, selection of domains, etc. From these proceedings, you can see how IE systems work and how their performances are evaluated. The task definitions, annotation scheme and evaluation methods of MUC-6 are found in the MUC-6 web site (http://cs.nyu.edu/cs/faculty/grishman/muc6).
2. Actual IE systems: MUC Systems
(2-1)*Appelt, D. et al. (1993), “FASTUS: a finite-state processor for information extraction from real-world texts”, Proceedings of the International Joint Conference on Artificial Intelligence.
The system developed by SRI, FASTUS, is one of the representative systems of IE developed in MUC. If you are interested in more thorough descriptions of FASTUS, see (Hobbs,J.R. et al.: FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text, which appear in (8-5). The article is also available at http://www.ai.sri.com/~appelt/fastus-schabes).
(2-2)*Grishman, R. (1995), “The NYU system for MUC-6 or where's the syntax?”, Proceedings of the 6th Message Understanding Conference}, pp 167-176, DARPA.
Another representative of the current IE systems based on pattern-matching techniques (the expressive power almost equivalent to finite-state machines). While in FASTUS given patterns are expanded to surface patterns by macros, their system has a special input system that accepts surface examples and expands them, through question-answering with a system designer, to generalized surface patterns.
The system works deterministically and uses hand-made patterns. However, the NYU research group has tried various machine learning techniques to the named-entity recognition task. Those are based on decision trees (5-4), Maximum Entropy (5-5), etc.
Their tool kit, which enables users to define a simple ontology of a domain and check with actual corpora whether generated patterns properly work, is described in (Sekine, S. et al.(1998): An Information Extraction System and a Customization Tool, in Proceedings of the New Challenges in Natural Language Processing and its Application, Tokyo, available from http://www.cs.nyu.edu/cs/projects/proteus/ ).
(2-3) Krupka, G. R. and Hausman, K. (1998), “IsoQuest Inc.: Description of the NetOwl (tm) Extractor System as Used in MUC-7!”, Proceedings of 7th Message Understanding Conference (MUC-7)}, DARPA.
(2-4) Srihari, R. (1998), “A Domain Independent Event Extraction Toolkit”, AFRL-IF-RS-TR-1998-152 Final Technical Report, published by the Air Force Research Laboratory, Information Directorate, Rome Research Site, New York.
(2-5)*Srihari, R. and Li, W. (1999), “Information Extraction Supported Question Answering”, Proceedings of TREC-8.
TREC (Text Retrieval Conference: http://trec.nist.gov/) is a twin brother of MUC under DARPA-sponsored TIPSTER text program. While TREC used to focus on conventional Information Retrieval, it has started a new track called QA (question answering) track. The paper (2-5) describes an attempt of using an IE system (Textract developed by Cymfony Inc.) in the MUC framework to the QA task.
There are several interesting comments in this paper. For example, they claim that scenario templates (ST) in MUC were too domain specific and that the group had to redesign them for GE (General Event) templates in order to cope with open-ended nature of the QA task. A similar observation was made by Appelt in (1-1), though his remark was concerned with the discrepancy between linguistic representations and the output templates. FASTUS by SRI has an internal representation directly extracted from texts, from which a post processor generates the official MUC-6 templates.
3. IE Systems for Biology, Biomedicine, Biochemistry, etc.
(3-1) * Rindflesch, T. C. et al. (2000), “EDGAR: Extraction of Drugs, Genes and Relations from the Biomedical Literature”, Proceedings of the Pacific Symposium on Bio-informatics (PSB'2000), Hawaii, USA, January.
Unlike most of the other systems that were developed by NLP research groups, this system was developed by a research group specialized in Bioinformatics. Domain specific resources like UMLS, etc. are effectively used, together with special programs like MetaMap that maps noun phrases to UMLS semantic types. The terms that the authors used in this paper are somewhat different from standard terms of the NLP community. Their “under-specified parser”, for example, is like a shallow parser or a chunker in the NLP community. Some of the techniques illustrated look domain specific and highly dependent on available resources and their properties. However, the paper is full of insightful observations that have to be reflected in IE systems for this domain (See also (7-4)).
(3-2) Craven, M. and Kumlian, J. (1999): Constructing Biological Knowledge Base by Extracting Information from Text Sources, in Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB-99).
Assuming that semantic lexicon (i.e. pairs of terms and their semantic classes) is given, they try to train their probabilistic models, Naïve Bayes and a relational learning algorithm. While the naïve Bayes model identifies sentences in abstracts that contain relevant information, the relational learning algorithm identifies which phrases in the identified sentences participate in relevant relationships. The system is not an IE system, but a knowledge acquisition system for IE (See also Section 7).
(3-3) Proux, D., et al. (2000): A Pragmatic Information Extraction Strategy for gathering Data on Genetic Interaction, in proceedings of 8th International Conference on Intelligent Systems for Molecular Biology, La Jolla, Calif., pp 279 - 285
A conventional IE system is applied to extract gene interactions. It uses POS-tagging based on FST and HMM, shallow parsing of local structures around verbs, and knowledge-based processing by using Conceptual Graph of Sowa. They reported that complex linguistic structures including nominalized events, co-ordinations etc. in Medline abstracts hamper the performance.
(3-4) Milward, T., et al. (2000): Automatic Extraction of Protein Interactions from Scientific Abstracts, in Proceedings of Pacific Symposium on Biocomputing, pp538-549, World Scientific Press.
(3-5)Hamphrays, K., et al. (2000): Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures, in Proceedings of Pacific Symposium on Biocomputing, pp 72-80, World Scientific Press
These two papers reported preliminary results of applying IE tool kits to biology texts. (3-4) is by SRI, Cambridge (Highlight) , while (3-5) is by Sheffield University (Gate, LaSIE). Due to the nature of preliminary experiments, the details of their evaluation methods are not given.
4. IE Systems with full parsers
Recently, there have been a few attempts of using full-fledged sentential parsing for IE. They claim that the later stages of IE i.e. merging of templates, co-reference recognition, etc. become convoluted without explicit recognition of sentence structures (See (6-3)). It is also noted in (3-1) and (3-2) that sentences in abstracts of scientific papers tend to have complex sentential structures like nested co-ordinations, which cause difficulties on simple shallow parsing or pattern-matching techniques.
While (4-1) is a general purpose IE system, (4-2) and (4-3) are applied to Biochemistry fields.
(4-1) Ciravegna,F. et al. (1999) “Full Text Parsing using cascades of Rules”, in Proceedings. of the Ninth Conference of the European Chapter of the Association for Computational Linguistics (EACL99)
This is a paper by a European project FACILE. Their work is also found in (Black, W., et al. (1998): “FACILE: Description of the NE System Used for MUC-7", in Proceedings of Message Understanding Conference Proceedings (MUC-7)). A tool kit of IE (Pinocchio) has also been developed (http://ecate.itc.it:1025/cirave/pinocchio).
(4-2) Park, J. C., et al. (2001): Bi-directional Incremental Parsing for Automatic Pathway Identification with Combinatory Categorical Grammar, in Proceeding of PSB, Hawaii (to appear)
Many interesting examples are given to show the reasons why more linguistically sound frameworks are required for IE in Bio-chemistry application. They focus on extracting protein-protein interactions from texts. Efficiency of full parsing based on CUG is improved by restricting the analysis to structures around a set of designated verbs. They use CUG (Categorical Unification Grammar) as the grammar formalism.
(4-3)*Yakushiji, A., et al. (2001): Event Extraction from Biomedical Papers using a Full Parser, in Proceedings of PSB 2001, Hawaii http://www-tsujii.is.s.u-tokyo.ac.jp/)(to appear).
They use a special algorithm for efficient parsing for HPSG-like grammar formalisms and devise a special method of extracting relevant information from partial parse results. Even if full sentential parses are not obtained, relevant information is to be extracted from partial parse results.
5. Named entity recognition using machine learning techniques
While techniques based on hand-made patterns had been dominant till MUC-5, there were a strong interest in applying machine learning techniques to the task in MUC-6 and MUC-7. Since the NE task in IE shares common techniques with term recognition, a comprehensive bibliography is given by the bibliography of Dr. S.Ananiadou. The following list focuses on NE recognition in IE and particular, those using machine learning techniques (See also Section 7).
(5-1)**Bikel, D. M., et al. (1997), “Nymble: a High-Performance Learning Name-finder”, Proceedings of the Fifth Conference on Applied Natural Language Processing, Morgan Kaufmann Publishers, pp. 194-201.
Among NE systems based on learning methods, the system in this paper shows the best performance of the time. This is the first system that uses HMM for the NE task. The paper shows that HMM based on a set of simple orthographic features gives a remarkably good result, e.g. around 90 % accuracy.
(5-2) *Collier, N., Nobata, C., and Tsujii, J. (2000), “Extracting the Names of Genes and Gene Products with a Hidden Markov Model”, Proceedings of the 18th International Conference on Computational Linguistics (COLING-2000), Saarbrucken, Germany.
The named entity recognition techniques based on HMM were first applied to terms in the bio-chemical domain. The result shows that simple orthographical features work fairy well in this domain as well, while the overall performance is not as good as the NE task in MUC. The same group applied a decision tree method to the same problem with the same set of features, the result of which is less than HMM (See: Nobata, C., et al.(1999): “Automatic Term Identification and Classification in Biology Texts”, in Proceeding. of 5th Natural Language Processing Pacific Rim Symposium, Beijing). The results show that the NE task in the biology domain is much harder than the MUC domain, due to abundant uses of multi-word expressions, abbreviations, complex co-ordination within term expressions, etc.
(5-3)**Borthwick, A. et al. (1998), “Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition”, Proceedings of the Sixth Workshop on Very Large Corpora, pp 152-160.
(5-4) Sekine, S., Grishman, R. and Shinou, H. (1998), “A decision tree method for finding and classifying names in Japanese texts”, Proceedings of the Sixth Workshop on Very Large Corpora.
These two papers are by the NYU group. While simple HMM does not allow multi-facet features, decision trees and ME (maximum entropy) can deal with them. In particular, ME accept a large set of features, from which it learns which features are relevant to the task. ME seem to outperform other learning methods.
(5-5)*Collins, M. and Singer, Y. (1999), “Unsupervised Models for Named Entity Classification”, Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, University of Maryland, USA (http://www.research.att.com/~singer/).
Supervised learning methods tend to require substantial human effort of preparing annotated corpora. This nullifies the advantage of trainable systems over the knowledge engineering approach (See the discussion in (1-1) by Appelt). This paper shows that a new technique called “Co-Training” can reduce the burden (Blum, A. and Mitchell, T. (1998): Combining Labeled and Unlabeled Data with Co-Training, in Proceedings of COLT-98, Madison, Wisconsin, http://www.cs.cmu.edu/~avrim/). Starting with a small set of rules, the system learns semantic classifiers of PERSON, ORGANIZATION and LOCATION, from unlabeled data.
(5-6) Fukuda, et al., (1999): “Toward Information extraction: Identifying protein names from biological papers”, in Proc. of the Pacific Symposium on Biocomputing 98 (PSB 98), Hawaii.
This is not a paper of NE recognition based on machine learning. However, the paper shows for the first time that terms in bio-chemical domains, in particular, protein names can be recognized by a set of simple heuristics. The performance of the system was good, while the evaluation method had not been well established.
6. Coreference resolution
After creating templates of individual entities and events with their properties, an IE system merges them to form integrated templates from which all kinds of information of the same entities and events are to be obtained. Coreference identification plays a crucial role in this stage.
(6-1) Aone, C. and Bennet, S. W. (1995), “ Evaluating automated and manual acquisition of anaphora resoluation strategies”, Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL-95), pp 122-129, Cambridge, MA, June.
(6-2) Hirschman, L. et al. (1997), “Automating Coreference: The Role of Annotated Training Data”, Proceedings of the AAAI Spring Symposium on Applying Machine Learning to Discourse Processing.
(6-3)*Kehler, A. (1997), “Probabilistic Coreference in Information Extraction”, Proceedings of the 1997 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.
The paper discusses probabilistic models based on Maximum Entropy (ME) for coreference resolution. The paper assumes that the first stage (entity and event recognition) is performed by FASTUS and returns possible coreference relationships with their ratings, which will be used by a downstream application system. An interesting comment that lack of reasonable linguistic representations in FASTUS makes the coreference resolution task unnecessary difficult is found in this paper.
(6-4) Kennedy, C. and Boguraev, B. (1996), “Anaphora for everyone: Pronominal anaphora resolution without a parser”, Proceedings of the 16th International Conference on Computational Linguistics (COLING-96).
7. Knowledge Acquisition and Ontology
(7-1)*Muslea, I. (1999): “Extraction Patterns for Information Extraction Tasks: A Survey”, in Proceedings of The AAAI-99 Workshop on Machine Learning for Information Extraction.
In order to build a pattern-based IE system, you have to prepare a set of patterns manually. Most of IE tool-kits developed provide some devices which lessen this manual effort, like macros in FASTUS, generalization of examples in the NYU system, etc. However, these devices still require substantial human efforts. Design of patterns also assume existence of proper domain ontology. Though there exist a few domain-independent ontologies like cyc, word-net, euro-wordnet, EDR, etc., it is often the case that these domain independent ontologies are not so effective for IE. This is definitely the case for IE systems for specific scientific texts.
Therefore, automatic acquisition of patterns and ontologies from texts has attracted significant interests recently. While this paper is not comprehensive (eg: it does not cover substantial amounts of corpus-based research in computational linguistics such as acquisition of sub-categorization frames, word clustering, etc.), this is a good introduction to the filed and shows how the two fields, KA (Knowledge Acquisition) and IE, are now merging.
(7-2) Soderland, S., et al. (1995): CRYSTAL: Inducing a Conceptual Dictionary, in Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI 95).
CRYSTAL is a part of the IE system developed by the University of Massachusetts, which is used by their language analyzer BADGER. While BADGER is domain independent, the dictionary of CN (Concept Nodes), which CRYSTAL is to build from an annotated corpus, contains domain specific information. The information in a CN is essentially the same as patterns that are used by other IE systems based on pattern-matching techniques.
CRYSTAL accepts annotated texts (annotated in terms of semantic classes of phrases) and takes them as examples of patterns. It tries to generalize those examples to generalized patterns by using hierarchy of concepts. In this paper, they use UMLS in as the semantic class hierarchy [See also (Aseltine, J. (1999): WAVE: An Incremental Algorithm for Information Extraction, In Proceedings of the AAAI Workshop on Machine Learning for Information Extraction).]
(7-3) Riloff, E. and Jones, R. (1999): Learning Dictionary for Information Extraction by Multi-Level Bootstrapping, in Proceedings of the 16th National Conference on Artificial Intelligence (AAAI-99).
CRYSTAL in (7-2) assumes that semantic classes are given before learning of patterns. In this paper, they propose “mutual bootstrapping” that learns semantic classes and patterns simultaneously.
(7-4) Rindflesh, et al. (1999): Mining Molecular Binding Terminology from Biomedical Text, in Proceedings of AMIA-99.
This is a part of the project of (3-1). The program ARBITER recognizes terms that are relevant to molecular binding relationships.
(7-4) Maynard, D., and Ananiadou, S. (2000): “Identifying Terms by their Family and Friends”, in Proceedings of Coling 2000, Saarbrucken, Germany.
The first step to acquire domain ontologies is to collect terms of a given domain. In particular, there are many multi-word terms in Medicine and biology, which is one of the major causes of difficulties in term recognition in these fields. The paper addresses how to collect these multi-word terms based on collocation distribution of words.
(7-5) Guarino, N., et al. (1995): Ontologies and Knowledge bases towards a Terminological Clarification, in Proceedings of Towards Very large Knowledge Bases, pp 25- 32
Interests in ontology has emerged through discussions in various research fields such as knowledge representation and sharing in Artificial Intelligence, multi-lingual machine translation, data management, CALS, community software, etc. It has its own root in philosophy. This paper discusses what ontology means, what otology is and what it is not. Guarino’s web site (http://www.ladseb.pd.cnr.it/infor/people/Guarino.html) provides a comprehensive list of useful sites in this field.
(7-6) Useful sites of Domain Independent Ontology:
(7-6-1) Cyc (http://www.cyc.com/): Encyclopedic knowledge base.
(7-6-2) Wordnet (http://www.cogsci.princeton.edu/)
(7-6-3) Euro-wordnet (http://www.hum.uva.nl/~ewn/)
(7-6-4)EDR( http://www.iijnet.or.jp/edr/): Lexical resources: mono-lingual dictionaries of Japanese and English and a Concept dictionary. The concept dictionary can be used as general otology.
(7-6-5)Mikrokosomos: (http://crl.nmsu.edu/Research/Projects/mikro/): Ontology for knowledge-based MT
(7-7)*McEntire, R., Karp, P., et al. (2000): An Evaluation of Ontology Exchange Languages for Bioinformatics, in Proceedings of 8th International Conference on Intelligent Systems for Molecular Biology, La Jolla, pp 239-250
The languages for representation and exchange of ontology are compared and evaluated in terms of Biology application. Seven candidates, ASN.1, ODL, Onto, OML/CKML, OPM, XML/RDF and UML, are evaluated in details for biochemistry application.
(7-8)*Tateishi, Y. et al. (2000): Building an Annotated Corpus in the Molecular-Biology Domain, in Proceedings of Workshop on Semantic Annotation and Intelligent Content, Coling 2000, Saarbrucken, Germany, pp 28-34.
This is a paper on text annotation for biology texts, not ontology. However, the semantic annotation assumes certain background ontology of the field, which is now being constructed by a group of the University of Tokyo (http://www-tsujii.is.s.u-tokyo.ac.jp/).
8. Basic Techniques
(8-1) ** Rabiner, L. and Juang, B. (1986), “An introduction to hidden Markov models”, IEEE ASSP Magazine, pp 4-16, January.
A standard paper of introduction to HMM.
(8-2) Quinlan, J.R. (1993): “C4.5: Programs for Machine Learning”, Morgan Kaufman Publishers, San Mateo, Calif.
A standard textbook of decision trees. The program C4.5 is avilable from http://www.cse.unsw.edu.au/~quinlan/. The advanced version C5.0 is commercially available.
(8-3) **Viterbi, A. J. (1967), “Error bounds for convolution codes and an asymptotically optiumum decoding algorithm”, IEEE Trans. Information Theory, IT-13(2), pp 260-269.
When probabilities of state transitions and emission probabilities are given, a naive way of finding the optimal transition path - the most probable path) involves a large search space and time-consuming. The Viterbi algorithm is an efficient algorithm for solving this problem. The time complexity of this algorithm is O(TN2), where T and N are the length of the sequence and the number of states. A modified version of this algorithm which combines the original version with certain rules that constrain legitimate sequences widely used.
(8-4)**Pereira, F., et al.. (1991): “Finite State Approximation of Phrase Structure Grammars”, Proceedings of 29th Meeting of the Association for Computational Linguistics, Berkeley, California, pp246-255.
The discussion in this paper justified the use of FS instead of more expressive frameworks in NLP, and has influenced on directions of NLP research. One of the precursor of this direction was a discussion given by Church, K.W. (On Memory limitations in Natural Language Processing, MIT Laboratory of Computer Science Technical Report MIT/LCS/TR-245,1980).
(8-5)*Roche, E. and Schabes, Y. (eds.) (1997): Finite State Langugae Processing, The MIT Press.
A good text book on finite state techniques. The techniques and the ideas of CG (Constraint Grammar) in this book are materialized as a commercial product ENGCG of Lingsoft, the performance of which in tagging of English texts is impressive. Their web site is http://www.lingsoft.fi/cgi-bin/engcg. FST approximation of phrase structure grammar by Pereira (8-4) also appears in this textbook.