CS6120: Natural Language Processing

Homework 2

Return to basic course information.

Assigned: Tuesday, 27 February 2013
Due: 11:59pm, Friday, 14 March 2013


  1. If you collaborated with others, you must write down with whom you worked on the assignment. If this changes from problem to problem, then you should write down this information separately with each problem.
  2. Email the instructor the requested written answers, code, and instructions on how to (compile and) run the code.


  1. Generating sentences from a grammar [10 points]

    First, download the initial grammar: grammar1. The file contains self explanatory comments, which are delimited by #. The basic format is:

    	1 VP	Verb NP
    Ignore the leading "1" for now. The first symbol is the left-hand side of a context-free rewrite rule; the remaining one or more symbols are right-hand-side terminals and non-terminals. There is no typographic distinction enforced between terminals and non-terminals; rather, any symbol that does not appear on a left-hand side is by definition a terminal. You may add new nonterminals, but don't add any new terminals. We'll explain why below.

    Now let's get down to work:

    Write a random sentence generator in the language of your choice. When handing it in, make sure we can run it like so:

    	  ./generate grammar1 5
    to print five random sentences from the grammar grammar1 to standard output. You should see output like:
    	  the king rides every horse .

    Your program should start with the START symbol and recursively choose rewrite rules until termination. Each rule is preceded by a non-negative real number. In grammar1, there are two possible expansions of NP with the weights 20 and 1. This means you should pick the first option with probability 20/(20 + 1). Don't hardwire this grammar into your program. Read it anew each time it runs, so that you can modify the grammar later.

    To see what happens with different weights, copy grammar1 to grammar2 and modify it. For instance, you can assert that the is a more common determiner than a or another like so:

    	      1 Det a
    	      1 Det another
    	      4 Det the
    	      1 Det this
    In particular, the would now have probability 1/3 to all the others' 1/12.

    Play around with the weights in grammar2 and see how your generated sentences change. Can you make the average sentence length much longer or shorter?

    Hand in the probabilized generation program, your modified grammar2, and ten random sentences.

  2. Describing English with context-free rules [20 points]

    1. Copy grammar2 to grammar3. Modify the latter so that, in addition to the strings of grammar2, it can also generate phenomena like:
      	  do coconuts speak ?
      	  who does Arthur suggest she carry ?
      	  are they suggesting Arthur ride to Camelot ?
      	  the Holy Grail was covered by a yellow fruit .
      	  do not speak !
      	  Arthur will have been riding for eight nights .
      	  Arthur and Guinevere migrate frequently .
      	  he knows what they are covering with that story .
      	  the king drank to the castle that was his home .
      Of course, you don't want to generate just the nine sentences above, but others with the same constructions.

      Hand in your expanded grammar3.

    2. Create a final grammar, grammar4. Modify it to model some new constructions of English that you find interesting. Explain in comments to your grammar file what you are doing. Hand in grammar4.

  3. Just to be safe, use a backoff bigram model [5 points]

    Even after working on your grammar for a while, you won't be able to generate all grammatical sentences of English. While it might be OK for some applications to restrict their output to a subset of English, you often want to handle input of a great range of expressions. We've therefore included a second subgrammar in the file, headed by the nonterminal S2. This subgrammar is simply a tag bigram model. Copy grammar4 to grammar5, and set S2's weight to something greater than its initial value of 0. You might also tinker with the transitions among the different tags. Unless you set S2's weight overly high, you shouldn't see too many word salad sentences. (For fun, though, you could reverse the mixing proportions of the two grammars and see what happens.)

  4. Addendum: Competititve grammar writing

    Besides the instructor and TA looking over your shoulder, what's your incentive to come up with interesting constructions in the question above or a good balance between S1 and S2? The answer is simple and trendy: gamification! Using the final grammar5 you hand in, we'll generate some sentences from the S1 subgrammar and judge their grammaticality. We'll then try to parse your grammatical output using other students' grammars. The winning grammar will be one that assigns a high average log likelihood on the set of grammatical sentences. A nominal number of extra credit points—and glory!—will be granted.