A (typical) challenge for natural language processing of biology text

This page contains a brief excerpt from the following article:

Selective binding of perfringolysin O derivative to cholesterol-rich membrane microdomains (rafts)

A. A. Waheed, Yukiko Shimada, Harry F. G. Heijnen, Megumi Nakamura, Mitsushi Inomata, Masami Hayashi, Shintaro Iwashita, Jan W. Slot, Yoshiko Ohno-Iwashita

which appeared as Proc. Natl. Acad. Sci. USA, Vol. 98, Issue 9, 4926-4931, April 24, 2001. Here is a link to the full paper.

Here is a typical paragraph from this typical paper. Analyzing the content of such text and extracting information from it is typical of the tasks that challenge researchers in the field of BioNLP.

"To exclude the possibility of potential detergent-induced artifacts, modification of rafts, and/or redistribution of BCtheta during extraction (1), we also isolated rafts without detergent treatment. After incubation with BCtheta , the platelets were sonicated vigorously and floated into a sucrose gradient. Again BCtheta distribution is enriched in FLDF (Fig. 2a, -Tx, fractions 4-6). The FLDF obtained without Tx detergent extraction were also enriched in cholesterol and sphingomyelin (Fig. 2c and b). To exclude the possibility that free BCtheta was liberated by sonication or with time, as a result of equilibration, had contaminated the FLDF, we carried out sucrose gradient sedimentation with free BCtheta placed at the bottom of the centrifuge tube. BCtheta was detected only in bottom fractions and not in FLDF (data not shown). BCtheta detected in fractions 10 and 11 (Fig. 2a) probably represents a toxin that is liberated because of extensive sonication. To analyze whether the Tx-insoluble material is exclusively enriched in FLDF, the pellet obtained after extraction with Tx at 4°C was mildly sonicated and fractionated on a sucrose gradient. Fig. 2d shows that more than 86% of cholesterol was distributed in FLDF (fractions 3-5), indicating that the Tx-insoluble material is enriched in FLDF in terms of cholesterol distribution. We therefore suggest that BCtheta recovered in the Tx-insoluble fractions of erythrocytes, MOLT-4, and A431 cells (Fig. 1b) represents the portion bound to rafts. This is confirmed by our observations with erythrocytes, where BCtheta distributed predominantly in FLDF (data not shown) when the isolated Tx-insoluble pellet of BCtheta-bound erythrocytes (Fig. 1b, pellet at 4°C) was fractionated on the sucrose density gradient."


This describes a portion of their experiments in which a particular artifact was examined by doing a procedure under contrasting conditions (first three sentences). Then a test to exclude another possibility is done, and so forth.

This leads them to a conclusion at the end which was confirmed by further observations they made (last sentence).

Notice the constant interplay between the text discussion and the figures, in which data is presented (see the original paper for the figures). It's not easy to properly analyze research papers in this field without dealing with figures also. This surely raises the ante, but it is the reality of the way life sciences research is described in publications. Their Figure 1 even has a data graph embedded inside another data graph, to save space presumably -- a bit of a desperate move, but the figure is understandable. This dependence on figures is the reason that a good part of my research is devoted to diagram understanding. For my work on diagrams, see my CV for diagram papers in general, online copies of some of my recent diagrams papers, my diagram demo site and my current funding for diagram research.

There certainly is a lot more to bionlp than term extraction, a task that's occupying a lot of people's attention just now.

-- Bob Futrelle, November 2002