David A. Smith

David on the slope Assistant Professor, College of Computer and Information Science, Northeastern University
440 Huntington Avenue
West Village H, Room 356
Boston, MA 02115
Phone: (617) 373-8526
dasmithATSIGNccs.neu.edu

Office hours: Thursdays, 3:00-5:00, or by appointment

With colleagues in CS, Social Sciences, and Humanities, I am starting the NULab for Texts, Maps, and Networks as Northeastern's research center for digital humanities and computational social science. My research focus is on natural language processing and computational linguistics, with applications to machine translation, information retrieval, and digital humanities.

Recent news: Our NEH-funded Infectious Texts project on viral networks in 19th-century newspapers was featured in Wired. Our work analyzing text reuse in bills in the US Congress was featured in the Economist.

I'm also excited about recent grants from the Mellon Foundation, Proteus Infrastructure: Work Aggregation and Entity Extraction; and a Google Faculty Research Award, Trees over Time: Diachronic Syntactic Analysis of Millions of Books with Unsupervised Domain Adaptation. Here's a press release with an awkward picture.

Until August 2012, I was a Research Assistant Professor in the Department of Computer Science at the University of Massachusetts, Amherst. I still advise Ph.D. students there as an adjunct professor.

Formerly: Natural Language Processing at Johns Hopkins University; and Head Programmer, Perseus Project, Tufts University

See also my curriculum vitae in PDF.

Graduate Students

Jason Naradowsky

Kriste Krstovski

Xiaoye "Tiger" Wu

Shaobin Xu

Liwen Hou

Teaching

Fall 2014: Information Retrieval (CS6200): Thursdays, 6-9, Shillman 335

Spring 2014: Natural Language Processing (CS6120)

Fall 2012-2014: Information Retrieval (CS6200)

Spring 2013: Natural Language Processing (CS6120)

Fall 2012: Information Retrieval (CS6200/IS4200)

Spring 2012: Search Engines (CS 446)

Fall 2011: Residential Academic Program First-Year Seminar (CS 191a)

Fall 2009: Introduction to Natural Language Processing (CS 585)

Spring 2009: James Allan, R. Manmatha, and I led a seminar on Mining Text and Images in Digital Libraries Using Grid Computing.

August 2006: Charles Schafer and I presented a tutorial, Overview of Statistical Machine Translation [pdf], at the Association for Machine Translation in the Americas.

Fall 2005: Noah Smith and I designed and taught a course on Empirical Research Methods in Computer Science.

Refereed Conference & Journal Publications

John Wilkerson, David A. Smith, and Nick Stramp. Tracing the flow of policy ideas on legislatures: A text reuse approach. American Journal of Political Science, 2015 (accepted). [ PDF ]

David A. Smith, Ryan Cordell, Elizabeth Maddock Dillon, Nick Stramp, and John Wilkerson. Detecting and modeling local text reuse. In Proceedings of the ACM+IEEE-CS Joint Conference on Digital Libraries, 2014. Nominated for best paper. [ PDF ]

Shaobin Xu, David Smith, Abigail Mullen, and Ryan Cordell. Detecting and evaluating local text reuse in social networks. In ACL Joint Workshop on Social Dynamics and Personal Attributes in Social Media, 2014.

David A. Smith, Ryan Cordell, and Elizabeth Maddock Dillon. Infectious texts: Modeling text reuse in nineteenth-century newspapers. In IEEE Workshop on Big Data and the Humanities, 2013. [ PDF ]

Kriste Krstovski, David A. Smith, Hanna M. Wallach, and Andrew McGregor. Efficient nearest-neighbor search in the probability simplex. In Proceedings of the International Conference on the Theory of Information Retrieval (ICTIR), 2013. [ PDF ]

Kriste Krstovski and David A. Smith. Online polylingual topic models for fast document translation detection. In Proceedings of the Workshop on Statistical Machine Translation, 2013.

Jacqueline L. Feild, Erik G. Learned-Miller, and David A. Smith. Using a probabilistic syllable model to improve scene text recognition. In International Conference on Document Analysis and Recognition (ICDAR), 2013.

Xiaoxi Xu, Tom Murray, Beverly Park Woolf, and David A. Smith. Mining social deliberation in online communication: If you were me and I were you. In International Conference on Educational Data Mining (EDM), 2013.

Jason Naradowsky, Tim Vieira, and David A. Smith. Grammarless parsing for joint inference. In Proceedings of the International Conference on Computational Linguistics (COLING), 2012. [ PDF ]

Jason Naradowsky, Sebastian Riedel, and David A. Smith. Improving NLP through maginalization of hidden syntactic structure. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2012.

Sebastian Riedel, David A. Smith, and Andrew McCallum. Parse, price and cut-delayed column and row generation for graph based parsers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2012.

Yanchuan Sim, Noah A. Smith, and David A. Smith. Discovering factions in the computational linguistics community. In ACL Workshop on Rediscovering 50 Years of Discoveries, 2012. [ PDF ]

Michael Bendersky and David A. Smith. A dictionary of wisdom and wit: Learning to extract quotable phrases. In NAACL Workshop on Computational Linguistics for Literature, pages 69-77, 2012. [ PDF ]

David Bamman and David A. Smith. Extracting two thousand years of Latin from a million book library. ACM Journal on Computing and Cultural Heritage, 5(1), 2012.

David A. Smith, R. Manmatha, and James Allan. Mining relational structure from millions of books: Position paper. In Proceedings of the CIKM BooksOnline Workshop, pages 49-54, 2011.

Jae-Hyun Park, W. Bruce Croft, and David A. Smith. A quasi-synchronous dependence model for information retrieval. In Conference on Information and Knowledge Management (CIKM), pages 17-26, 2011. [ PDF ]

Jinyoung Kim, W. Bruce Croft, David A. Smith, and Anton Bakalov. Evaluating an associative browsing model for personal information. In Conference on Information and Knowledge Management (CIKM), pages 647-652, 2011. [ PDF ]

Jeffrey Dalton, James Allan, and David A. Smith. Passage retrieval for incorporating global dependencies in sequence labeling. In Conference on Information and Knowledge Management (CIKM), pages 355-364, 2011. [ PDF ]

Kriste Krstovski and David A. Smith. A minimally supervised approach for detecting and ranking document translation pairs. In Proceedings of the Workshop on Statistical Machine Translation, pages 207-216, 2011. [ PDF ]

Jangwon Seo, W. Bruce Croft, and David A. Smith. Online community search using conversational structures. Information Retrieval, 14(6):547-571, 2011. [ PDF ]

Andrew Kae, David A. Smith, and Erik Learned-Miller. Learning on the fly: A font-free approach towards multilingual OCR. International Journal on Document Analysis and Recognition, 14(3):289-301, 2011. [ PDF ]

Michael Bendersky, W. Bruce Croft, and David A. Smith. Joint annotation of search queries. In Proceedings of the Association for Computational Linguistics, pages 102-111, 2011. [ PDF ]

John S. Y. Lee, Jason Naradowsky, and David A. Smith. A discriminative model for joint morphological disambiguation and dependency parsing. In Proceedings of the Association for Computational Linguistics, pages 885-894, 2011. [ PDF ]

Elif Aktolga, James Allan, and David A. Smith. Passage reranking for question answering using syntactic structures and answer types. In European Conference on Information Retrieval (ECIR), pages 617-628, 2011. [ PDF ]

Jinyoung Kim, Anton Bakalov, David A. Smith, and W. Bruce Croft. Building and evaluating a semantic representation for personal information. In Conference on Information and Knowledge Management (CIKM), pages 1741-1744, 2010.

Xiaobing Xue, W. Bruce Croft, and David A. Smith. Query reformulation using query distributions. In Conference on Information and Knowledge Management (CIKM), pages 1497-1500, 2010.

Michael Bendersky, W. Bruce Croft, and David A. Smith. Structural annotation of search queries using pseudo-relevance feedback. In Conference on Information and Knowledge Management (CIKM), pages 1537-1540, 2010. [ PDF ]

Sebastian Riedel, David A. Smith, and Andrew McCallum. Inference by minimizing size, divergence, or their sum. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), pages 227-234, 2010. [ PDF ]

Sebastian Riedel and David A. Smith. Relaxed marginal inference and its application to dependency parsing. In Proceedings of the Conference on Human Language Technology of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pages 760-768, 2010. [ PDF ]

Jangwon Seo, W. Bruce Croft, and David A. Smith. Online community search using thread structure. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), pages 1907-1910, 2009.

David A. Smith and Jason Eisner. Parser adaptation and projection with quasi-synchronous grammar features. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 822-831, 2009. [ PDF | PowerPoint slides ]

David Mimno, Hanna Wallach, Jason Naradowsky, David A. Smith, and Andrew McCallum. Polylingual topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 880-889, 2009. [ PDF ]

Michael Bendersky, W. Bruce Croft, and David A. Smith. Two-stage query segmentation for information retrieval. In Proceedings of the 32nd International ACM SIGIR Conference, pages 810-811, 2009. [ PDF ]

David A. Smith and Jason Eisner. Dependency parsing by belief propagation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 145-156, 2008. [ PDF | PowerPoint slides ]

Keith Hall, Jiří Havelka, and David A. Smith. Log-linear models of non-projective trees, k-best MST parsing and tree-ranking. In Proceedings of the CoNLL Shared Task, pages 962-966, 2007.

David A. Smith and Noah A. Smith. Probabilistic models of nonprojective dependency trees. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 132-140, 2007. [ PDF | PowerPoint slides ]

David A. Smith and Jason Eisner. Bootstrapping feature-rich dependency parsers with entropic priors. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 667-677, 2007. [ PDF | PowerPoint slides ]

David A. Smith and Jason Eisner. Minimum risk annealing for training log-linear models. In Proceedings of the International Conference on Computational Linguistics and the Association for Computational Linguistics, pages 787-794, 2006. [ PDF ]

Markus Dreyer, David A. Smith, and Noah A. Smith. Vine parsing and minimum risk reranking for speed and precision. In Proceedings of the CoNLL Shared Task, pages 201-205, 2006. [ PDF ]

David A. Smith and Jason Eisner. Quasi-synchronous grammars: Alignment by soft projection of syntactic dependencies. In Proceedings of the HLT-NAACL Workshop on Statistical Machine Translation, pages 23-30, 2006. [ PDF | PowerPoint slides ]

Noah A. Smith, David A. Smith, and Roy W. Tromble. Context-based morphological disambiguation with random fields. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 475-482, 2005. [ PDF ]

David A. Smith and Noah A. Smith. Bilingual parsing with factored estimation: Using English to parse Korean. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 49-56, 2004. [ PDF ]

F.J. Och, D. Gildea, S. Khudanpur, A. Sarkar, K. Yamada, A. Fraser, S. Kumar, L. Shen, D. Smith, K. Eng, V. Jain, Z. Jin, and D. Radev. A smorgasbord of features for statistical machine translation. In Proceedings of the Conference on Human Language Technology and the North American Association for Computational Linguistics, pages 161-168, 2004. [ PDF ]

David A. Smith and Gideon S. Mann. Bootstrapping toponym classifiers. In Proceedings of the HLT-NAACL Workshop on Analysis of Geographic References, pages 45-49, 2003. [ PDF ]

David A. Smith, Anne Mahoney, and Gregory Crane. Integrating harvesting into digital library content. In Proceedings of the 2nd ACM+IEEE Joint Conference on Digital Libraries, pages 183-184, Portland, OR, July 2002. [ PDF ]

David A. Smith. Detecting events with date and place information in unstructured text. In Proceedings of the 2nd ACM+IEEE Joint Conference on Digital Libraries, pages 191-196, Portland, OR, July 2002. [ PDF ]

David A. Smith. Detecting and browsing events in unstructured text. In Proceedings of the 25th Annual ACM SIGIR Conference, pages 73-80, Tampere, Finland, August 2002. [ PDF ]

David A. Smith and Gregory Crane. Disambiguating geographic names in a historical digital library. In Proceedings of the European Conference on Digital Libraries (ECDL), pages 127-136, Darmstadt, Germany, September 2001. [ PDF ]

David A. Smith, Anne Mahoney, and Jeffrey A. Rydberg-Cox. Management of XML documents in an integrated digital library. Markup Languages: Theory and Practice, 2(3):205-214, 2000. [ PDF ]

David A. Smith, Anne Mahoney, and Jeffrey A. Rydberg-Cox. Management of XML documents in an integrated digital library. In Proceedings of Extreme Markup Languages 2000, pages 219-224, Montreal, August 2000.

David A. Smith, Jeffrey A. Rydberg-Cox, and Gregory R. Crane. The Perseus Project: A digital library for the humanities. Literary and Linguistic Computing, 15(1):15-25, 2000.

David A. Smith. Textual variation and version control in the TEI. Computers and the Humanities, 33(1-2):103-112, 1999.

Gregory Crane, Clifford E. Wulfman, Lisa M. Cerrato, Anne Mahoney, Thomas L. Milbank, David Mimno, Jeffrey A. Rydberg-Cox, David A. Smith, and Christopher York. Towards a cultural heritage digital library. In Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2003, pages 75-86, Houston, TX, June 2003. [ PDF ]

Gregory R. Crane, Robert F. Chavez, Anne Mahoney, Thomas L. Milbank, Jeffrey A. Rydberg-Cox, David A. Smith, and Clifford E. Wulfman. Drudgery and deep thought: Designing a digital library for the humanities. Communications of the Association for Computing Machinery, 44(5):35-40, 2001. [ PDF ]

Gregory Crane, David A. Smith, and Clifford E. Wulfman. Building a hypertextual digital library in the humanities: A case study on London. In Proceedings of the First ACM+IEEE Joint Conference on Digital Libraries, pages 426-434, Roanoke, VA, June 2001. Best paper award. [ PDF ]

Other Publications

John Wilkerson, David A. Smith, and Nick Stramp. The inclusiveness of lawmaking: A text reuse approach to tracing the progress of policy ideas in legislation. In MPSA Annual Meeting. Midwest Political Science Association, April 2014.

John Wilkerson, David A. Smith, and Nick Stramp. Tracing the flow of policy ideas in legislatures: A text reuse approach. In New Directions in Analyzing Text as Data. London School of Economics, September 2013. [ PDF ]

John Wilkerson, David A. Smith, Nick Stramp, and James Dashiell. Tracing the flow of policy ideas in legislatures: A computational approach. In APSA Annual Meeting. American Political Science Association, September 2013.

Ryan Cordell, Elizabeth Maddock Dillon, and David A. Smith. Uncovering reprinting networks in nineteenth-century American newspapers. In Digital Humanities, 2013.

Ryan Cordell and David A. Smith. Uncovering reprinting networks in nineteenth-century American newspapers. In Chicago Colloquium on Digital Humanities & Computer Science, November 2012.

Xiaoye Wu and David A. Smith. Right-branching tree transformation for eager dependency parsing. Technical Report CIIR-776, University of Massachusetts, 2010. [ PDF ]

Jason Naradowsky, Joe Pater, David Smith, and Robert Staubs. Learning hidden metrical structure with a log-linear model of grammar. In Computational Modelling of Sound Pattern Acquisition, pages 59-60, Edmonton, February 2010. Department of Linguistics, University of Alberta.

Joe Pater, David A. Smith, Robert Staubs, Karen Jesney, and Ramgopal Mettu. Learning hidden structure with a log-linear model of grammar. In Linguistic Society of America (LSA), Baltimore, January 2010.

Gregory Druck and David A. Smith. Computing conditional feature covariance under non-projective tree conditional random fields. Technical Report UM-CS-2009-060, University of Massachusetts, 2009.

David A. Smith. Debabelizing libraries: Machine translation by and for digital collections. D-Lib Magazine, 12(3), March 2006. [ HTML ]

Anne Mahoney, Jeffrey A. Rydberg-Cox, David A. Smith, and Clifford E. Wulfman. Generalizing the Perseus XML document manager. In Linguistic Exploration: Workshop on Web-based Language Documentation and Description, Philadelphia, December 2000. [ HTML ]