Assigned:
Thu 05 Sep 2006
Due:
Thu 26 Oct 2006
Problem 1 (50 points)
Portions courtesy James Allan http://www-ciir.cs.umass.edu/~allan/
A. In this
homework problem, you will explore Zipf’s and Heap's Laws.
Go to the Project Gutenberg site (http://www.gutenberg.org)
and download Alice in Wonderland by Lewis Carol at http://www.gutenberg.org/dirs/etext91/alice30.txt
(Ignore the header and consider only the text
starting at "ALICE'S ADVENTURES IN WONDERLAND.") Use this perl script to strip the text of punctuation, obtaining the
original text as a list of words. For example, on a Unix base systems, you
should run a command like
parse.pl alice30.txt > outputfile
Write a program or script that counts words, unique words, and frequency of occurrence.
For the most frequent 25 words and for the most frequent 25 additional
words that start with the letter c (a total of 50 words), print
the word, the number of times it occurs, its rank in the overall list of
words, the probability of occurrence, and the product of the rank and the
probability. Also indicate the total number of words and the total number of
unique words that you found. Discuss how this text satisfies Zipf's Law (or why it does
not).
B. Suppose that you were building an inverted file and that you omitted all words that occur fewer than five times (i.e., one to four times). According to Zipf’s Law, what proportion of the unique words in the collection would you not index (justify your answer)? What proportion would actually be omitted in the Alice in Wonderland text above?
C. Using the
plotting program of your choice, plot (a) frequency vs. rank and (b) log frequency
vs. log rank in order to "visualize" Zipf's Law for this data set. Furthermore,
plot (c) number of unique words vs. total number of words and (d) log number
of unique words vs. log total number of words in order to "visualize" Heap's
Law for this data set.
D. Using your log-log
data, determine the constants in Zipf's and Heap's Laws for this data set
by performing a line fit using least squares. (If you are not familiar with
the least squares line fitting technique, consult http://mathworld.wolfram.com/LeastSquaresFitting.html
for a complete discussion.) Give the final
Zipf's and Heap's Law equations for this data set. For the 25
"c" words from part A above, list the actual frequency of occurrence and
the predicted frequency using your derived equation.
Problem 2 (10 points) Problem 3 (10 points) B. Assume now these actual frequencies of occurrence (the
numbers are per 1000 letters): space A B C D E F G H I J K L M 185.9 64.2 12.7 21.8 31.7 103.1 20.8 15.2 46.7 57.5 0.8 4.9 32.1 19.8 N O P Q R S T U V W X Y Z 57.4 63.2 15.2 0.8 48.4 51.4 76.9 22.8 8.3 17.5 1.3 16.4 0.5 What is the entropy of
a message? C. Can you draw
any conclusions from the previous two results ? D. What is the
average length of a word? (Hint: Consider the probability of seeing a
space.) Problem 4 (10
points) Problem 5 (10
points) Term-Document weights
courtesy Huahai Yang http://www.albany.edu/~hy973732/ Given a query Q and a collection of documents
A, B and C shown in table 1, determine the similarity of A, B, C to Q using
language modeling. Determine the probability of generating the query from
the language model associated with the documents using the simple multinomial
model and the following smoothing techniques: (a) No
smoothing, i.e., maximum likelihood language model. (b)
Laplace smoothing. (c) Witten-Bell smoothing. For the corpus
model, use the average of the maximum likelihood models created in part (a),
e.g., the corpus probability of “cat” is the average of the maximum
likelihood probabilities associated with “cat” in documents A, B, and
C. The numbers in the table are the weights
between each term and document.
Table 1 Term-Document
Weights Cat Food Fancy Q 3 4 1 A 2 1 0 B 1 3 1 C 0 2 2 Problem 6 (10
points) Courtesy Huahai
Yang http://www.albany.edu/~hy973732/ Convert the raw frequency weights in Table 2
to Okapitf*idf weights. Use formula Okapitf = tf/(tf+K) with
K=1.5 Table 2 Raw Term Frequency for a
Collection of Documents Cape Whale Nightmare Arm Door A 0 3 0 0 0 B 1 1 0 1 5 C 3 3 1 1 9 D 0 0 1 9 0
Download the Porter Stemmer (http://www.tartarus.org/~martin/PorterStemmer/)
and play with it! As we discussed, the stemmer makes two types of
mistakes:
(a) Give 5 examples of pairs of words that should have the same root but that
the Porter stemmer stems to different roots, e.g., “matrix” and
“matrices.”
(b) Give 5 examples of pairs of words that should have different roots but
that the Porter stemmer stems to the same root, e.g., “universe” and
“university.”
You may not use examples from our class lecture or slides, of course. Print the output of the stemmer on
your examples.
Consider a message alphabet of 27 characters comprised
of the 26 English letters and space.
A. Assume that all 27 characters have
equal frequencies of occurrence (1/27 each). What is the entropy of a
message?
According to Heaps' Law, what proportion of a corpus (a
collection of texts) must be read before 90% of its vocabulary has been
encountered (assume beta=0.5.) Hint: To solve this problem you don't need to know the
value of K.