A brief introduction to NLP

Updated 12 February 2006

Basic NLP

Natural language processing, also called computational linguistics or natural language understanding, attempts to use automated means to process text and deduce its syntactic and semantic structure. This is done for many purposes such as to understand the nature of language, to extract specific information from text, to effect machine translation, to produce automated summaries, etc.

The two primary aspects of natural language that are focused on (there are others) are syntax and the lexicon. Syntax, or the patterns of language, define structures such as the sentence (S) made up of noun phrases (NPs) and verb phrases (VPs). These structures include a variety of modifiers such as adjectives, adverbs and prepositional phrases. The determination of the syntactic structure of a sentence is done by a parser. At the bottom of all this are the words, and information about these is kept in a lexicon, which is a machine-readable dictionary that may contain a good deal of additional information about the properties of the words, notated in a form that parsers can utilize.

It is now possible to parse text reasonably well, determining the syntactic structure of a sentence. Unfortunately, in any real sentence there are notorious problems of ambiguity. These are usually so effortlessly resolved by a human reader that it's sometimes difficult to appreciate just how thoroughly they can confound the parsing process. Consider the simple sentence,

"Voltage-gated sodium and postassium channels are involved in the generation of action potentials in neurons." (Science, v219, p1337, Human Genome issue).

To a biologist, this sentence is clear and unambiguous. A parser faces many difficulties when analyzing the sentence. For example, a parser may group the constituents to form "Voltage-gated sodium", when it is the channels that are voltage-gated. There is also the issue of whether there are single channels for both sodium and potassium or separate channels for the two ions -- the structure of the English in the sentence leaves this open. These ambiguities are classic ones in parsing and there are no simple ways to resolve them on the basis of sentence syntax alone.

At the semantic level the specialized two-word term, "action potentials" has a meaning that cannot be determined compositionally, that is, it cannot reasonably be determined by looking in a lexicon for the separate and independent meanings of the words "action" and "potential" and combining those meanings. The simple examples that linguists use to drive home these points are a "mailman" who is not made out of mail and a "snowman" who does not deliver snow. On the other hand, the meaning of an item such as "adaptor molecule" can be built up with reasonable accuracy from the meanings of its two components.

This all may seem a bit foolish, but a parser approaches these problems in literal and narrow ways compared to an experienced human "parser" who understands the domain of the utterance (sentence, paragraph, etc.) its context, and from these, the interpretation of the sentence which is most probably correct.

Beyond the level of syntactic parsing there are further problems of meaning, the semantics problem. A syntactic analysis that discovers that the subject of a sentence is a noun (which happens to designate a protein) and the verb "degrade", has not actually constructed a representation of the meaning of the sentence which refers to a process that occurs over time that results from specific cleavage or instability because of the medium or temperature or other causes and mechanisms. Semantic analysis is a difficult problem. But semantics is an important part of natural language analysis because it is what is needed to resolve the many ambiguities that remain after syntactic analysis.

Simpler Approaches -- Pattern Recognition

The preceding description sounds like a tall order, and it certainly can be. Fortunately, there are a wide variety of simpler approaches that are sometimes adequate for many purposes. The most common, certainly in the biological and bioinformatics community, is to use the same tool used for sequence analysis for linguistic analysis. Many people use Perl for finding patterns in sequences and are now using it to find patterns in text in just the same way. For some, this is perfectly adequate. But when it's not, there are ways to go beyond using these techniques. (I should mention that patterns in English are provably more complex than the patterns that can be expressed in Perl's set of regular expressions -- English is not a "regular language" in the technical sense. Nevertheless, regular expressions are adequate for a lot of uses.)

Manual construction of lexicons and grammars is a huge and tedious process, so many techniques have been developed to avoid this manual work. For example, starting with a modest amount of manually parsed text, a parser can be "trained" by constructing rules that match the manually produced structures. This is a machine learning approach. Other sets of analyses use massive amounts of text to look for certain regularities. This is the statistical approach to NLP. It has become an important area over the last decade with the increasing availability of large on-line corpora.

Here's another short introduction to the field, an FAQ from Zurich. There's a useful one-page summary at the ACL site with links to the full first chapters of two excellent books (total: 56 pages).