Biology corpora

Note added 6 July 2003, updated 7 May 2005:

The BioMed Central research article corpus is available for data mining. BioMed Central has published 8874 peer reviewed research articles (as of 7 May 2005), all of which are covered by BioMed Central's open access license policy: http://www.biomedcentral.com/info/about/charter. Unlike a traditional journal's license agreement, BioMed Central's license allows completely free reuse and redistribution of the content by anyone. Note that these are full-text articles, not abstracts. Further details are available here.

The standard biology corpus, large collection of text in electronic form, is MEDLINE. It is now possible to obtain the entire corpus in electronic form, subject to the appropriate license agreements. (I have not obtained copies of Medline myself and would be happy to hear from anyone out there who has and is using it now.)

Many specialized NLP studies have certainly been done by downloading modest numbers of selected Medline abstracts and analyzing them. But serious NLP work to build lexicons or gather statistical data or do machine learning requires corpora of many millions of words, e.g., a hundred thousand abstracts.

Most publishers of journals strongly restrict what can be done with the online journals they publish. Some journals, e.g., PNAS, allow free access to issues older than one month. But whether free or paid, this access does not normally extend to wholesale downloading of the full text of thousands of papers and their storage and manipulation on your computers. I intend to continue to explore this topic myself and as with Medline, would be interested to hear from people who might be negotiating with publishers for large-scale access to electronic journal articles.

More and more of the journals published online by Highwire Press are becoming freely available, with varying licensing agreements.