IS U900/CS G224 Natural Language Processing
		         Assignment 1

Spring 2006, Prof. Hafner
Due Date: Tuesday, January 16 

Part 1.  Type the sample text below (from Harry Potter and the Sorcerer's
Stone by J.K. Rowling) into a a text file with the line breaks as shown
below. 

a. Write a python program to tokenize the text and create an output
file with one word token per line (in the order they appear in the text)

b. Write a python program to tokenize the text and perform frequency
counts. Your program should create an output file with one word type 
per line (in alphabetical order), along with the word's absolute and 
relative frequency.

Hand in a printout of your two programs and their output.

Sample text:
    Everyone from wizarding families talked about Quidditch constantly.
    Ron had already had a big argument with Dean Thomas, who shared their
    dormitory, about soccer.  Ron couldn't see what was exciting about
    a game with only one ball where no one was allowed to fly.

Part 2. Consider the various ways specific amounts of U.S. money can be 
described in English text.  (This does not include indirect references
such as "the price of a Big Mac".)

a. Try to write one or more regular expressions that (taken as a group) 
will match such descriptions, and will NOT match anything else.  

Note that this is a really hard problem which is impossible to do 
perfectly. (Why?)  So make up a few RE's, and be prepared to
discuss the pros and cons of your solution.

b. Write a python program that uses the regular expression package 
(re) to implement your solution computationally, and demonstrate it on
some data that you make up.