CS G224 Natural Language Processing
		         Assignment 1

Spring 2006, Prof. Hafner
Due Date:  Part I,  email by noon Tuesday Jan 17
	   Part II, print and hand in Wednesday Jan 18


Part 1.  For the sample text below (from Harry Potter and the Sorcerer's
Stone by J.K. Rowling): do the following by hand:

   -- separate the text into tokens and assign part-of-speech (POS) tags 
   to all tokens using the the Penn tagset (Figure 8.6 in your textbook).
   Be prepared to discuss your choices in class.

   -- send the result by email to hafner@ccs.neu.edu, with subject line:
   "NLP Assignment 1" by Tuesday Jan 17 at noon.

   -- the email should be a plain ASCII file (not a MIME, pdf, or
   Word attachment), and should include your name and email address in 
   the body of the message.  

   -- a short article written by the developers of the Penn Treebank that
   you may find helpful (in particular, see Section 2) is provided on
   the class web site Resources page as a supplemental reading for Week 1.

Sample text:
    Everyone from wizarding families talked about Quidditch constantly.
    Ron had already had a big argument with Dean Thomas, who shared their
    dormitory, about soccer.  Ron couldn't see what was exciting about
    a game with only one ball where no one was allowed to fly.

Part 2.  For the two files containing the first 1000 words and the last
1000 words of the tagged Brown corpus (provided on the class web site
Resources page or as Unix files: /course/csg224/.www/resources/brown*.txt)
do the following:

   -- Compute the frequency (i.e., the count) for each word type appearing 
   in the 2000 words of text. Print this information sorted in descending 
   order of frequency. A word type is defined orthographically (i.e., by 
   its spelling), so one word type can have multiple POS tags.  You can
   use whatever computer tools or languages you wish to do this.  Be 
   prepared to explain what you did in class.

   -- Answer the questions: what percent of the word types have more
   than one POS tag in the 2000 word sample?  what percent of the
   word tokens represent word types with more than one POS tag in the 
   2000 word sample?

   -- Note that the percents above are calculated using the Brown
   tagset, not the Penn tagset. Do you think the percents would be
   lower or higher if we used the Penn tagset?  Give an example to
   illustrate your conclusion. (Section 2 of the Penn Treebank article
   may help if you are not sure.)