IS U900/CS G224 Natural Language Processing Assignment 1 Spring 2006, Prof. Hafner Due Date: Tuesday, January 16 Part 1. Type the sample text below (from Harry Potter and the Sorcerer's Stone by J.K. Rowling) into a a text file with the line breaks as shown below. a. Write a python program to tokenize the text and create an output file with one word token per line (in the order they appear in the text) b. Write a python program to tokenize the text and perform frequency counts. Your program should create an output file with one word type per line (in alphabetical order), along with the word's absolute and relative frequency. Hand in a printout of your two programs and their output. Sample text: Everyone from wizarding families talked about Quidditch constantly. Ron had already had a big argument with Dean Thomas, who shared their dormitory, about soccer. Ron couldn't see what was exciting about a game with only one ball where no one was allowed to fly. Part 2. Consider the various ways specific amounts of U.S. money can be described in English text. (This does not include indirect references such as "the price of a Big Mac".) a. Try to write one or more regular expressions that (taken as a group) will match such descriptions, and will NOT match anything else. Note that this is a really hard problem which is impossible to do perfectly. (Why?) So make up a few RE's, and be prepared to discuss the pros and cons of your solution. b. Write a python program that uses the regular expression package (re) to implement your solution computationally, and demonstrate it on some data that you make up.