CS G224 - Natural Language Processing Spring 2006 - Prof. Hafner Assignment 6 - Due March 3 You are to implement a chart parser that makes use of the the lexicon you implemented for Assignment 5. (Therefore, those who did not create a "lookup" function will need to do so.) The top level of your parser should be a read/parse/print-trees loop that allows the user to enter sentences and see each of the maximal parses, as described below. You can require that the input words be put into a list if that makes your programming job easier. You need not handle punctuation. Below you will find a context free grammar G that should work for most of the Harry Potter sentences given in Assignment 5. Conjunctions could not be included because they give rise to left recursion. You will also find advice on programming the chart parser, which, combined with the course readings should make this assignment reasonably straightforward. To demonstrate your parser, run the following selected sentences, plus two failure inputs: -- a sentence containing a word not in the dictionary -- a sentence whose words are known but which does not have a parse Snakes bite He looked at the grandfather clock They took their usual places in Lockhart's classroom Harry noticed that they were straining to escape the straps holding them inside the box The "maximal parses" are all the maximal length edges starting at Node 0, regardless of their grammatical category. ----------------------------------------------------------------- Grammar G: Terminal Symbols: Noun, ProperNoun, PersonalPronoun, PossessivePronoun, RelPro, Determiner, Adverb, Adjective, Number, Aux, Modal, Verb, Preposition, Conjunction, 's Rules: (non-terminals can be inferred) Note: Parentheses () mean a choice Angle brackets <> mean an optional constituent Both of these must be expanded into multiple rules S --> NP VP NP --> PersonalPronoun NP --> ProperNoun NP --> Nominal // expands into 4 rules DTP --> Determiner DTP --> PossessivePronoun DTP --> ProperNoun 's DTP --> Determiner Nominal 's /// A general rule gives rise to /// left recursion, so can't /// be used in top down parsing AP --> Adjective AP --> Adverb Adjective AP --> Number AP Nominal --> Noun Nominal --> Noun Nominal Nominal --> Noun PP Nominal --> Noun VP Nominal --> Noun RelClause RelClause --> RelPro VP // Relative Pronouns: who, that PP = Preposition NP VP --> VBG // VBG means "verb group" VP --> VBG TO VP // TO = the word to VP --> VBG THAT S // THAT = the word that VBG --> Verb VBG --> Modal Verb VBG --> Aux Verb /// Note that rules like NP --> NP Conjunction NP are not allowed /// in top down parsing ------------------------------------------------------------------ Programming a chart parser for top-down parsing I. Data structures: Grammar -- a set of Rule objects. A lookup-LHS function that returns a subset of the rules with the given LHS. Chart -- a sequence of nodes. Each node consists of two sets of edges: originating and terminating (at this node). Edge -- a structure consisting of: originating node, terminating node, label. The label consists of a category symbol (usually called CAT), and some other elements. For incomplete edges, the rest of the label consists of a sequence of remaining unparsed RHS symbols, followed by a sequence of already- parsed RHS symbols. An already-parsed RHS symbol combines its CAT with a pointer to the complete edge that satisfies the CAT. (in other words, each already-parsed symbol is the head of a sub-tree). For complete edges, the sequence of remaining RHS symbols is empty. For lexical edges (which are a variety of complete edges), the rest of the label is just the original word that appeared in the input (usually a string). II. Applying the fundamental rule: When an incomplete edge E is added to the chart terminating at node i: for each complete edge originating at node i: if its CAT matches the first remaining RHS symbol of E, then create a 1-node extension of E and put it on the unprocessed stack end for When a complete edge E is added to the chart originating at node j: for each incomplete edge terminating at node j: if its first remaining RHS symbol matches CAT(E), then create a 1- node extension of the incomplete edge and put it on the unprocessed stack. end for III. Overall algorithm -- Top down, depth first, left to right parsing: 1. For each word in the input: Lookup the word's definitions, and for each definition, create a lexical edge and add it to the chart. (The fundamental rule will not apply since there are no incomplete edges yet.) 2. lookup all grammar rules with LHS = S, create a "self-loop" edge for each one, originating and terminating at node 0, and put them on the unprocessed stack. 3. While the unprocessed stack is not empty: Remove the first edge E from the unprocessed stack. If E is a complete edge, add it to the chart. If E is an incomplete edge, before adding it to the chart check to see if there is an edge (complete or incomplete) originating at E's terminating node with a CAT that matches the first unparsed RHS symbol of E. If not, create a self-loop edge for each grammar rule whose LHS matches the first unparsed RHS symbol of E, and put them on the unprocessed stack. 4. Find the maximal length edges that originate at node 0 and display them in bracketed tree format.