CS G224 Natural Language Processing Assignment 3 Spring 2006, Prof. Hafner Due Date: Feb. 1, 2006 In parsing natural language, we normally separate the grammar rules (for which parts of speech serve as the terminal symbols) from the lexicon. This is illustrated in Figures 9.2 and 9.3 in your textbook. After tokenizing (and possibly stemming), we look up the words in the lexicon, ending up with a graph of possibilities. Your assignment for this week is to design a natural language lexicon program and implement it in an object oriented language. You are highly encouraged to learn a little Python and do this assignment in Python. However, if you don't have time, programs in Java, C++ or C# will be accepted. Be prepared to discuss your design and your lexicon contents in class. Your lexicon will have Word objects as its components, with each word having one or more Def objects (i.e., definitions) attached to it. A definition consists of a lexical category (POS) and zero or more features, as shown in the table on page 65 of your text. (the + is not necessary). We will use Penn tags for our lexical categories. The diagram below is intended to be suggestive. Lexicon ---> Word POS Feature Feature POS Feature Word . . . Word For example, the word goose shown in the table has three POS values: NN and VB and VBP, therefore it would have 3 definitions. The features will be used for finer distinctions: initially nouns (NN and NNS) should have the feature MASS or COUNT. The auxilliary verbs (forms of be do and have) should have the feature AUX. Verbs should have the feature INTRANS if they can be used without a direct object, and TRANS if they can be used with a direct object. (Note that some verbs such as "eat" have both.) Two lookup methods should be provided: with one (string) argument, all definitions should be returned for the word (a list or tuple). With two string arguments, the second argument represents a part of speech tag, and the word's definition for that part of speech should be returned. For purposes of this assignment, we will make the simplifying assumption that a word has at most one definition in a given lexical category. A method should also be exist for printing definitions in a readable form. Demonstrate your lexicon program by 1. creating a lexicon containing the words from the first sentence of the Harry Potter extract we looked at for Assignment 1. (not including punctuation). Include the lexical categories that you think COULD be correct for each word in SOME CONTEXT. (for example, "had" can *never* be a VBP (present tense verb) or a MD (modal verb)), so those codes should not be included. On the other hand, "shared", which is VBD in the sample text, can also be VBN. Take your best guess regarding the features mentioned above. 2. Run some test cases, invoking the two lookup methods (and printing the results) for a few selected words (at least one with more than one lexical category), and include some failure cases also. Turn in a printout of your lexicon code, the test program, and its output. This program will be used later as part of a parser exercise.