Some preprints and where to find more

This page describes two excellent sources of preprints and papers, arXiv.org, immediately below, and the NEC ResearchIndex, further down the page. Below there are a few examples of interesting and/or useful papers. There are many more where they came from.

The arXiv.org e-Print repository

The primary repository for preprints in computational linguistics is at the site http://www.arXiv.org, or more specifically, at the Computing Research Repository section of that site. Or you can link to the page of recent additions on Computation and Language which is constantly being updated.

Here are some recent papers that I've found interesting for the work I am doing in bionlp, ones that may be of general interest.

A Decision Tree of Bigrams is an Accurate Predictor of Word Sense by Ted Pedersen
Abstract: This paper presents a corpus-based approach to word sense disambiguation where a decision tree assigns a sense to an ambiguous word based on the bigrams that occur nearby. This approach is evaluated using the sense-tagged corpora from the 1998 SENSEVAL word sense disambiguation exercise. It is more accurate than the average results reported for 30 of 36 words, and is more accurate than the best results for 19 of 36 words.

I find the memory-based approaches (MBLP) particularly appropriate for the biology literature which contains many "standard" phrases repeated many times. An excellent introduction to the field can be found in Walter Daelemans' introduction to a special issue on the topic: Memory-Based Language Processing. Introduction to the Special Issue. In: Journal of Experimental and Theoretical AI (JETAI), 11:3, 1999. (Preprint). Here is a cached PDF version of his posted Postscript file.

Here's an example of the MBLP approach, applied to shallow parsing. (Shallow parsing does not attempt to decide on all attachments, conjunctive structures and other larger structural aspects of sentences.) Memory-Based Shallow Parsing by Walter Daelemans, Sabine Buchholz, Jorn Veenstra
Abstract: We present a memory-based learning (MBL) approach to shallow parsing in which POS tagging, chunking, and identification of syntactic relations are formulated as memory-based modules. The experiments reported in this paper show competitive results, the F-value for the Wall Street Journal (WSJ) treebank is: 93.8% for NP chunking, 94.7% for VP chunking, 77.1% for subject detection and 79.0% for object detection.

Another example of shallow parsing is A Learning Approach to Shallow Parsing by Marcia Mu–oz, Vasin Punyakanok, Dan Roth, Dav Zimak from Proceedings of EMNLP-VLC'99, pages 168-178.
Abstract: A SNoW based learning approach to shallow parsing tasks is presented and studied experimentally. The approach learns to identify syntactic patterns by combining simple predictors to produce a coherent inference. Two instantiations of this approach are studied and experimental results for Noun-Phrases (NP) and Subject-Verb (SV) phrases that compare favorably with the best published results are presented. In doing that, we compare two ways of modeling the problem of learning to recognize patterns and suggest that shallow parsing patterns are better learned using open/close predictors than using inside/outside predictors.

Finding sentence boundaries is important. A much referred to paper on the topic is this one: A Maximum Entropy Approach to Identifying Sentence Boundaries
Abstract: by Jeffrey C. Reynar and Adwait Ratnaparkhi that appeared in the 5th ANLP Conference, 1997 We present a trainable model for identifying sentence boundaries in raw text. Given a corpus annotated with sentence boundaries, our model learns to classify each occurrence of ., ?, and ! as either a valid or invalid sentence boundary. The training procedure requires no hand-crafted rules, lexica, part-of-speech tags, or domain-specific information. The model can therefore be trained easily on any genre of English, and should be trainable on any other Roman-alphabet language. Performance is comparable to or better than the performance of similar systems, but we emphasize the simplicity of retraining for new domains.

The NEC ResearchIndex, "CiteSeer"

This site is a rich index of the Computer Science literature with extensive updated automated lists of citing documents, full-text search, downloadable full documents when available, etc. For an introduction to the site's many features see this page. To start searching for papers, go to this page.

A example of a useful entry in the ResearchIndex is Eric Brill's highly cited 1995 paper on part-of-speech tagging. He also has a chapter in Dale's recent book: Brill, E. 2000. Part-of-Speech Tagging, p. 403-414. In R. Dale, H. Moisl, and H. Somers (ed.), Handbook of Natural Language Processing. Marcel Dekker, New York. (a book which is mentioned on this site's Literature page)
Abstract of the 1995 paper: Recently, there has been a rebirth of empiricism in the field of natural language processing Manual encoding of linguistic information is being challenged by automated corpus-based learning as a method of providing a natural language processing system with linguistic knowledge Although corpus-based approaches have been successful in many different areas of natural language processing it is often the case that these methods capture the linguistic information they are modelling indirectly in large opaque tables of statistics This can make it difficult to analyze understand and improve the ability of these approaches to model underlying linguistic behavior In this paper we will describe a simple rule-based approach to automated learning of linguistic knowledge This approach has been shown for a number of tasks to capture information in a clearer and more direct fashion without a compromise in performance We present a detailed case study of this learning method applied to part of speech tagging.