Accomplishments and challenges in literature data mining for biology

Lynette Hirschman1, Jong C. Park, Junichi Tsujii, Limsoon Wong and Cathy H. Wu

Bioinformatics 18 (12) 1553-1561 (2002)


We review recent results in literature data mining for biology and discuss the need and the steps for a challenge evaluation for this field. Literature data mining has progressed from simple recognition of terms to extraction of interaction relationships from complex sentences, and has broadened from recognition of protein interactions to a range of problems such as improving homology search, identifying cellular location, and so on. To encourage participation and accelerate progress in this expanding field, we propose creating challenge evaluations, and we describe two specific applications in this context.


Andrade, M. and Valencia, A. (1998) Automatic extraction of keywords from scientific text: Application to the knowledge domain of protein families. Bioinformatics, 14, 600-607.

Aone, C. et al. (1998) SRA: Description of the IE2 system used for MUC-7. In Proc. 7th Message Understanding Conf.

Baclawski, K. et al. (2000) Knowledge representation and indexing using the unified medical language system. PSB 2000, 493-504.

Bader, G. et al. (2001) BIND-the biomolecular interaction network database. NAR, 29, 242-245.

Bairoch, A. and Apweiler, R. (2000) The Swiss-Prot protein sequence database and its supplement TrEMBL in 2000. NAR, 28, 45-48.

Bajic, V.B. (2000) Comparing the success of different prediction software in sequence analysis: A review. Brief Bioinform., 1, 214-228.

Blaschke, C. et al. (1999) Automatic extraction of biological information from scientific text: Protein-protein interactions. ISMB, 7, 60-67.

Chang, J. et al. (2001) Including biological literature improves homology search. PSB 2001, 374-383.

Collier, N. et al. (2000) Extracting the names of genes and gene products with a hidden Markov model. Int. Conf. Comput. Linguistics, 18, 201-207.

Craven, M. and Kumlien, J. (1999) Constructing biological knowledge bases by extracting information from text sources. ISMB, 7, 77-86.

DARPA, (1998) Proc. 7th Message Understanding Conf. Ding, J. et al. (2002) Mining MEDLINE: Abstracts, sentences, or phrases? PSB 2002, 326-337.

Eisenhaber, F. and Bork, P. (1999) Evaluation of human-readable annotation in biomolecular sequence databases with biological rule libraries. Bioinformatics, 15, 528-535.

Flybase Consortium, (2002) The FlyBase database of the Drosophila genome projects and community literature. NAR, 30,

106-108. Friedman, C. et al. (2001) GENIES: A natural language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17, S74-S82.

Fukuda, K. et al. (1998) Toward information extraction: Identifying protein names from biological papers. PSB 1998, 707-718.

Gene Ontology Consortium (2001) Creating the gene ontology resource: Design and implementation. Genome Res., 11, 1425- 1433.

Hahn, U. et al. (2002) Rich knowledge capture from medical documents in the MEDSYNDIKATE system. PSB 2002, 338-349.

Harabagiu, S. et al. (2001) FALCON: Boosting knowledge for answer engines. In Proc. 9th Text Retrieval Conf.

Hersh, W. et al. (2001) Challenging conventional assumptions of automated information retrieval with real users: Boolean searching and batch retrieval evaluations. Information Processing and Management, 37, 383-402.

Hirschman, L. (1998) The evolution of evaluation: Lessions from the Message Understanding Conferences. Computer Speech and Language, 12, 281-305.

Hodges, P. et al. (1998) Yeast Protein Database (YPD): A database for the complete genome of the saccharomyces cerevisiae. NAR, 26, 68-72.

Humphreys, K. et al. (2000) Two applications of information extraction to biological science journal articles: Enzyme interactions and protein structures. PSB 2000, 502-513.

Illiopoulos, I. et al. (2001) TEXTQUEST: Document clustering of MEDLINE abstracts for concept discovery in molecular biology. PSB 2001, 384-395.

Kanehisa, M. et al. (2002) The KEGG database at GenomeNet. NAR, 30, 42-46.

Kohn, K.W. (1999) Molecular interaction map of the mammalian cell cycle control and DNA repair systems. Mol. Biol. Cell, 10, 2703-2734.

Leroy, G. and Chen, H. (2002) Automated extraction of medical knowledge using underlying logic from medical abstracts. PSB 2002, 350-361.

Luo, L. et al. (1990) Identification, secretion, and neural expression of APPL, a Drosophila protein similar to human amyloid protein precursor. J. Neuroscience, 10, 3849-3861.

Makhoul, J. et al. (1999) Performance measures for information extraction. In Proc. DARPA Broadcast NewsWorkshop. pp. 249- 254.

Mani, I. et al. (2002) Automatically inducing ontologies from corpora. Technical note, MITRE.

Marcotte, E.M. et al. (2001) Mining literature for protein-protein interactions. Bioinformatics, 17, 359-363. Nature (1997) Obstacles of nomenclature. Nature, 389, 1.

Ng, S.-K. and Wong, M. (1999) Toward routine automatic pathway discovery from on-line scientific text abstracts. GIW, 10, 104- 112.

Ohta, T. et al. (2000) Building an annotated corpus from biology research papers. In Proc. COLING-2000 Workshop on Semantic Annotation and Intelligent Content. pp. 28-34.

Park, J. et al. (2001) Bidirectional incremental parsing for automatic pathway identification with combinatory categorial grammar. PSB 2001, 396-407.

Pruitt, K. and Maglott, D. (2001) refSeq and LocusLink: NCBI genecentered resources. NAR, 29, 137-140.

Putejovsky, J. and Castano, J. (2002) Robust relational parsing over biomedical literature: Extracting inhibit relations. PSB 2002, 362-373.

Rindflesch, T. et al. (2000) EDGAR: Extraction of drugs, genes, and relations from biomedical literature. PSB 2000, 517-528.

Sparck-Jones, K. and Galliers, J. (1996) LNAI 1083: Evaluating National Language Processing SystemsСAn Analysis and Review. Springer.

Stapley, B. and Benoit, G. (2000) Biobibliometrics: Information retrieval and visualization from co-occurrences of gene names in Medline asbtracts. PSB 2000, 529-540.

Stapley, B. et al. (2002) Predicting the subcellular location of proteins from text using support vector machines. PSB 2002, 374-385.

Swets, J.A. (1988) Measuring the accuracy of diagnostic systems. Science, 240, 1285-1293.

Thomas, J. et al. (2000) Automatic extraction of protein interactions from scientic abstracts. PSB 2000, 538-549.

Westbrook, J. et al. (2002) The Protein Data Bank: Unifying the archive. NAR, 30, 245-248.

Wilbur, W. (2002) A thematic analysis of the AIDS literature. PSB 2002, 386-397.

Wilbur, W. et al. (1999) Analysis of biomedical text for biochemical names: A comparison of three methods. AMIA Symp 1999, 176-180.

Wong, L. (2001) PIES, a protein interaction extraction system. PSB 2001, 520-531.

Wu, C. et al. (2002) The Protein Information Resource: An integrated public resource of functional annotation of proteins. NAR, 30, 35-37.

Wu, C. et al. (2001) iProClass: An integrated, comprehensive, and annotated protein classification database. NAR, 29, 52-54.

Xenarios, I. et al. (2002) DIP, the Database of Interacting Proteins: A research tool for studying cellular networks of protein interactions. NAR, 30, 303-305.

Yakushiji, A. et al. (2001) Event extraction from biomedical papers using a full parser. PSB 2001, 408-419.