
Here are a few interesting projects other than those visible on the
Relational Agents Group page
that I've worked on (or am currently working on) - this page is under construction.
- Annotated Congressional Record and Ideology Detection Engine
- This is a current work in progress, in two parts. The first part is nearly complete: I have taken ten years
of the Congressional Record, which is currently available only in human-readable format, and parsed the
contents, extracting each statement made by a member of Congress, and building a large SQL database of
floor statements associated with the senator or representative who spoke. Each senator and representative
is annotated in the database with CommonSpace scores and other
data.
- The second part utilizes this data to train a classifier, using a bigram/trigram language model, to
recognize terms and phrases which are significantly more likely to be uttered by a liberal than a conservative,
and vice versa. The full project proposal can be viewed here.
- The full database, in SQL Dump format, can be downloaded here [.tar.gz, 232MB]. Be sure to read the included README file. Note
that the data is somewhat noisy, as most has been automatically parsed from an imperfect source (i.e. there
are certainly some typos, may be a few misattributed statements, etc; some statements are missing, as well.
I will document the shortcomings of the data sometime or other.)
- Synchronous, Probabilistic, Lexicalized Tree Insertion Grammars for Machine Translation
- I worked on this project while a Special Student at Harvard. It began
as part of a seminar on natural language processing. I was a part of
initial discussions and took part in the design and implementation of
an early version of the system, as detailed in this report.
- The work begun in that seminar has since been significantly furthured, primarily by
Stuart Shieber,
Rebecca Nesson
and Alexander Rush. These researchers recently published
a long paper on this work at AMTA 2006.