Information Extraction

From May 2009 to August 2009, we did a research project on extracting medication-related information from patient health records. We were given some training data and 10 corresponding ground truth as an annotation set. Suppose our goal is to interpret this data 6
``Patient was taking fluocinonide $ 0.5\%$ cream 1 bag p.o. from Jan 12 to May 15 this year X 3 q.d. until ready for d/c home. Before this, the patient had a 50-point hematocrit drop.''
If this piece of text is on the $ 37^{th}$ and $ 38^{th}$ line of a patient record. The ground truth for this example is as follows.
m=``fluocinonide $ 0.5\%$ cream'' 37:3 37:5 $ \vert\vert$ do=``1 bag'' 37:6 37:7 $ \vert\vert$ mo=``p.o.'' 37:8 37:8 $ \vert\vert$ f=``X 3 q.d.'' 37:17 38:0 $ \vert\vert$ du=``from Jan 12 to May 15 this year ... until ready for d/c home'' 37:9 37:16 38:1 38:5 $ \vert\vert$ r=``50-point hematocrit drop'' 38:12 38:14 $ \vert\vert$ ln=``narrative''
Where ``m'' means ``medication name'', ``do'' means ``dosage'', ``mo'' means ``mode for the medication'', ``f'' means ``frequency to take the medication'', ``du'' means ``duration'', ``r'' means ``reason to take the medication'', ``ln'' indicates ``the information for this medication is from a narrative or a list''. The content within the double quote is the content for a specific field, and the number ``37:3 37:5'' is the offset of ``fluocinonide $ 0.5\%$ cream'' meaning it lies in line 37, within column 3 and column 5. If nothing is found, then use ``nm'' (not mentioned) instead. Lines begin with one and columns begin with zero.

The challenges here are:

In order to solve the first problem, we used a combination of the Orange Book and RxNorm databases, which covers $ 87\%$ of the medication name from the ground truth. Most of the medication names in the ground truth files that were not found were either misspellings, abbreviations, or generic names like ``antibiotics'' or ``pain medication''.

Because the number of the ground truth files provided is very limited, we decided to manually extract information from the rest records, five records each person per week. It turned out that these were extremely difficult to interpret manually -- we spent around three weeks finishing the first five records with still hundreds remaining. I have to mention here, this is one of the main reasons that I do research on the evaluation of IR systems where judgments are incomplete. During this period of time, I wrote a program to color the content of different fields in these medical records based on their ground truth data in Python. This program helped us to examine how well our system worked.

We divided our task into three parts. My own was to extract a medication's frequency. There are three basic categories: frequency, like ``b.i.d'', ``X 3 daily''; expressions that mean as needed, like ``prn'', ``as necessary''; temporal phrases that specify when a medication should be taken, like ``after meal'', ``at 4pm''. Also, they may be combined together, like ``x 3 a day after meal as needed''.

We developed a simple algorithm which we called Medication Frequency Decision Algorithm. It is shown in Algo. [1, 2, 3].
\begin{algorithm}
% latex2html id marker 33\SetLine
Given a medication offse...
...eturn $f$\;
\caption{Medication Frequency Decision Algorithm}
\end{algorithm}


\begin{algorithm}
% latex2html id marker 40\SetLine
Given string $length$, a...
...$pos_i$\;
}
}
}
Return $f$\;
\caption{REsearch Algorithm}
\end{algorithm}


\begin{algorithm}
% latex2html id marker 60\SetLine
Given a word and a list ...
...n TRUE\;
}{
Return FALSE\;
}
\caption{UnitCheck Algorithm}
\end{algorithm}

$ C_1, C_2, \alpha$, left string length, right string length, and span length respectively, are constant. UNITLIST contains most of the possible single element frequency strings. All these were obtained by analyzing the given ground truth data and extracted data manually. We found the best result when $ C_1 = 5, C_2 = 20, \alpha =
2$. Although we have not received results for larger sets of testing data, our algorithm was very effective for the training set.

Wu Jiang 2009-11-05