CS6120: Natural Language Processing

DRAFT for Spring 2021

This is a graduate course on natural language processing. This term, there is a separate undergraduate section (CS4120).

Instructor: David Smith, Associate Professor in Khoury College of Computer Sciences (Office Hours: TBA, or by appointment)

Teaching assistants:

Goals

As with many graduate courses, there are at least two goals: to familiarize yourself with the problems, methods, and literature of a field and to train yourself in carrying out research in that field. As in any cutting-edge area of computer science, you should think of research as the process of developing and evaluating new methods. Whether or not you are an “NLP researcher”, you should find these skills useful when solving problems in business, in other areas of science, or in public service. This course will cover facts about human language and about textual documents created by humans; statistical and computational methods used to draw inference about these data; and software and methodological tools used in NLP research.

Organization

This term, the course will adopt a flipped structure. Every student will be assigned to one discussion section. These discussion sections will consist of eight or nine students, who will meet every week with the instructor and additionally with a teaching assistant. The team for your course project will consist of one to four students from the same section. If you already know and like working with other students in the class, we will endeavor to arrange things so that you share a section.

Each week, readings and prerecorded lectures will cover certain topics. Roughly every other week, there will be a quiz to test your understanding. (You get points simply for attempting answers on these quizzes. See below for grading policies.) We will then be able to discuss the lectures, readings, quizzes, and programming assignments during section.

Discussion sections will also be the venue for brainstorming, developing, and presenting your course projects. Rather than an avalanche of presentations at the end of the course, this sustained discussion will hopefully offer more meaningful opportunities for feedback.

As usual, the instructor will hold drop-in office hours for students to talk one-on-one.

This course structure will, we hope, provide more opportunities for discussion and feedback than the lecture format this course took in previous terms. Some students felt comfortable asking questions and making comments in lectures, but many did not. The section structure should also allow us better to accommodate students unable to attend class in person, without some of the logistical difficulties of a large online class, or find themselves in inconvenient time zones. That said, please reach out to the professor or a teaching assistant if you are having difficulties or have suggestions about how things could go better. All section meetings will start off remote. As we assess the situation and as the registrar assigns us appropriate space, they may meet in person, at least in part.

Source Materials: Textbooks and Lectures

There are no required textbooks for this class; rather, for each topic covered, we will provide suggested readings, lectures, and notes from three partially-overlapping sets of source materials:

  1. Audio recordings of lectures with slides, as well as plain PDF slides, will be posted for each topic.
  2. The order of topics will generally follow Jacob Eisenstein's Introduction to Natural Language Processing (MIT Press, 2019), abbreviated E in the syllabus. You may buy it from the publisher or your favorite online bookseller, or find the author's manuscript online. Suggested readings will be keyed to section numbers that should be consistent across print, online, and PDF versions.
  3. In parallel, we will suggest readings from the third edition of Jurafsky and Martin's Speech and Language Processing, abbreviated JM3 in the syllabus. Readings will be keyed to section numbers.
  4. For some topics, we will also link to open-access research papers.
Although it can often be an effective strategy to learn about the same topic from several perspectives, you do not have to read all source materials for all topics. But which should you choose? Many students who are new to language data and NLP find Jurafsky and Martin more approachable, since they provide a good introduction to linguistic phenomena and detailed explanations for algorithms. If you have some background in linguistics or machine learning, Eisenstein's more formal approach and his coverage of the research literature might be more helpful. Some people simply have preferences about style and format, preferring reading over listening to lectures. You should be able to profit from the course whichever set of sources you prefer to use.

Coursework and Evaluation

Quizzes

To help consolidate your understanding of the material, there will be quizzes approximately every other week. They will be worth only a small number of points each, mostly achievable by simply attempting the questions, and make up 10% of the course grade.

Homework

There will be five homework assignments, each worth 10% of the course grade. They will include programming exercises to implement practical solutions for NLP tasks such as text classification, sequence tagging, translation, etc., and questions related to your implementation. The programming language used for these exercises is Python. You will submit your assignments with GitHub Classroom, so please create a GitHub account if you don't have one already.

Course project

You will complete a course project, on your own or as part of a group of two to four students from your section. Overall, the project will make up 40% of the course grade. We will announce specific milestones as the course progresses. Roughly, by the beginning of October, you will need to have formed your project teams and settled on a project topic with the instructor. By the beginning of November, your team should have written a plan for experiments and have collected data or have a realistic plan for data collection. Your team will give a presentation on your project in the last week of classes, and your final report will be due during exam week.

What makes a good project for this course? At a high level, you should work on a project that teaches you something you would like to learn, either about NLP methods or about other topics that NLP can illuminate, while still being achievable within the bounds of a single semester. Two broad categories of projects that often succeed are: replicating an existing paper that you find interesting, when code and data are available, and then designing and implementing an extension to that model; and using existing NLP methods and applying them to a different problem and dataset of interest to you. There are of course other possibilities that you and your teams can discuss when settling on a topic.

Late policy

Assignments are due at the announced due date and time, usually 11:59 p.m. Eastern Time. For one assignment only, you may submit an assignment up to four days late without needing to ask the professor. After that, if you need more time, you should ask the professor for an extension before the assignment is due.


Syllabus

This schedule is subject to change. Check back as the class progresses or consult the lecture notes on Piazza.

  1. Why NLP?
  2. Text classification
  3. Language modeling
  4. Sequence labeling
  5. Syntactic and semantic trees and graphs
  6. Compositional semantics as logical inference
  7. Distributional semantics and embeddings
  8. Information extraction: entities and relations
  9. Machine translation

Course Policies

Academic Integrity

For quizzes, programming assignments, and most other coursework, all work submitted for credit must be your own. The only exception is for the course project. If you work as a group of two to four students in total, we regard your submission as the joint responsibility of the whole team. Also, as in any research project, you may build on, and therefore cite, the ideas, papers, data, and code of others.

Accommodations for Students with Disabilities

If you have a disability-related need for reasonable academic accommodations in this course and have not yet met with a Disability Specialist, please visit www.northeastern.edu/drc and follow the outlined procedure to request services.

If the Disability Resource Center has formally approved you for an academic accommodation in this class, please present the instructor with your “Professor Notification Letter” during the first two weeks of the semester, so that we can address your specific needs as early as possible. You should also feel free to drop by the instructor's office hours to discuss your concerns about the course.