The digitization of knowledge and concerted retrospective scanning projects are making significant amounts of data — historical data in particular — increasingly available to readers and researchers in many disciplines. To make this data useful, our group is working on improving OCR, language modeling, multiple-version alignment, syntactic analysis, information extraction, and information retrieval. I will focus in particular on problems of inferring the relational structure latent in large collections of documents such as books, web pages, patent applications, grant proposals, and social media postings. Which books or passages quote, translate, paraphrase, and cite each other? This research requires improvements in modeling translation and other forms of similarity, as well as improvements in efficiently comparing large numbers of passages. Finally, I will discuss how similarity relations can be used to improve classification tasks.
David Smith is a Research Assistant Professor in the Computer Science Department at the University of Massachusetts, Amherst, where he conducts research on natural language processing, computational linguistics, information retrieval, digital libraries, and machine translation. He holds a Ph.D. in Computer Science from the Johns Hopkins University. Before graduate school, he was the head programmer for the Perseus Digital Library Project and received an A.B. in Classics from Harvard.