CS G224 Natural Language Processing Assignment 1 Spring 2006, Prof. Hafner Due Date: Part I, email by noon Tuesday Jan 17 Part II, print and hand in Wednesday Jan 18 Part 1. For the sample text below (from Harry Potter and the Sorcerer's Stone by J.K. Rowling): do the following by hand: -- separate the text into tokens and assign part-of-speech (POS) tags to all tokens using the the Penn tagset (Figure 8.6 in your textbook). Be prepared to discuss your choices in class. -- send the result by email to hafner@ccs.neu.edu, with subject line: "NLP Assignment 1" by Tuesday Jan 17 at noon. -- the email should be a plain ASCII file (not a MIME, pdf, or Word attachment), and should include your name and email address in the body of the message. -- a short article written by the developers of the Penn Treebank that you may find helpful (in particular, see Section 2) is provided on the class web site Resources page as a supplemental reading for Week 1. Sample text: Everyone from wizarding families talked about Quidditch constantly. Ron had already had a big argument with Dean Thomas, who shared their dormitory, about soccer. Ron couldn't see what was exciting about a game with only one ball where no one was allowed to fly. Part 2. For the two files containing the first 1000 words and the last 1000 words of the tagged Brown corpus (provided on the class web site Resources page or as Unix files: /course/csg224/.www/resources/brown*.txt) do the following: -- Compute the frequency (i.e., the count) for each word type appearing in the 2000 words of text. Print this information sorted in descending order of frequency. A word type is defined orthographically (i.e., by its spelling), so one word type can have multiple POS tags. You can use whatever computer tools or languages you wish to do this. Be prepared to explain what you did in class. -- Answer the questions: what percent of the word types have more than one POS tag in the 2000 word sample? what percent of the word tokens represent word types with more than one POS tag in the 2000 word sample? -- Note that the percents above are calculated using the Brown tagset, not the Penn tagset. Do you think the percents would be lower or higher if we used the Penn tagset? Give an example to illustrate your conclusion. (Section 2 of the Penn Treebank article may help if you are not sure.)