Return to BIONLP.ORG home page

Standard corpora

By "standard corpora" we mean primarily non-technical corpora such as from newspapers and similar sources. A corpus is more than just electronic text on line. There are billions of words of such text. A corpus is typically a collection that has been chosen to represent some particular genre or is made up of carefully chosen samples from a wide variety of genres. The most famous corpus in NLP is the Brown Corpus of 1 million words that is made up of about 500 samples from a variety of sources, each almost exactly 2000 words in length. Each word was manually assigned a part-of-speech tag, such as noun, adverb, comparative adverb, etc. Though the corpus is considered small by today's standards, it was the standard corpus for NLP studies for many years and the tagset it used was copied by many later workers for larger corpora. Because the words are tagged it is far easier to parse. In fact, some form of part-of-speech tagging is normally done before the parsing of any text is attempted. This is for two reasons: Many words are ambiguous with respect to their part-of-speech, e.g, "free" can be an adjective or a verb and "run" can be a noun or a verb and occasionally, an adjective. The other reason is that syntactic parsing relies heavily on the part-of-speech categories for each word in a sentence rather than any more complex characterization of each word.

The major supplier of corpora for NLP research is the Linguistic Data Consortium (LDC) based at U. Pennsylvania in Philadelphia. They have corpora in many languages as well as corpora transcribed from speech as well as audio corpora which are used in the development of speech recognition systems. Memberships are available that allow members to obtain data free or at substantially reduced prices. Educational and corporate membership rates differ substantially (non-profit versus for-profit).

At LDC, popular datasets include lexicons such as CELEX2. A most useful (and expensive) corpus is TREEBANK, described as "This CD-ROM contains over 1.6 million words of hand-parsed material from the Dow Jones News Service, plus an additional 1 million words tagged for part-of-speech. This material is a subset of the language model corpus for the DARPA CSR large-vocabulary speech recognition project. It also contains the first fully parsed version of the Brown Corpus, which has also been completely retagged using the Penn Treebank tag set. Also included are tagged and parsed data from Department of Energy abstracts, IBM computer manuals, MUC-3 and ATIS."

The corpus linguistics page by Mike Barlow mentioned on our On-line Resources page is another excellent gateway into the world of corpora.