CSG339 06F: Homework 02

Created: Thu 05 Oct 2006
Last modified:

Assigned: Thu 05 Sep 2006
Due: Thu 26 Oct 2006

Problem 1 (50 points)
Portions courtesy James Allan http://www-ciir.cs.umass.edu/~allan/

A. In this homework problem, you will explore Zipf’s and Heap's Laws.

Go to the Project Gutenberg site (http://www.gutenberg.org) and download Alice in Wonderland by Lewis Carol at http://www.gutenberg.org/dirs/etext91/alice30.txt (Ignore the header and consider only the text starting at "ALICE'S ADVENTURES IN WONDERLAND.") Use this perl script to strip the text of punctuation, obtaining the original text as a list of words. For example, on a Unix base systems, you should run a command like

parse.pl alice30.txt > outputfile

Write a program or script that counts words, unique words, and frequency of occurrence. For the most frequent 25 words and for the most frequent 25 additional words that start with the letter c (a total of 50 words), print the word, the number of times it occurs, its rank in the overall list of words, the probability of occurrence, and the product of the rank and the probability. Also indicate the total number of words and the total number of unique words that you found. Discuss how this text satisfies Zipf's Law (or why it does not).

B. Suppose that you were building an inverted file and that you omitted all words that occur fewer than five times (i.e., one to four times). According to Zipf’s Law, what proportion of the unique words in the collection would you not index (justify your answer)? What proportion would actually be omitted in the Alice in Wonderland text above?

C. Using the plotting program of your choice, plot (a) frequency vs. rank and (b) log frequency vs. log rank in order to "visualize" Zipf's Law for this data set. Furthermore, plot (c) number of unique words vs. total number of words and (d) log number of unique words vs. log total number of words in order to "visualize" Heap's Law for this data set.

D. Using your log-log data, determine the constants in Zipf's and Heap's Laws for this data set by performing a line fit using least squares. (If you are not familiar with the least squares line fitting technique, consult http://mathworld.wolfram.com/LeastSquaresFitting.html for a complete discussion.) Give the final Zipf's and Heap's Law equations for this data set. For the 25 "c" words from part A above, list the actual frequency of occurrence and the predicted frequency using your derived equation.

Problem 2 (10 points)
Download the Porter Stemmer (http://www.tartarus.org/~martin/PorterStemmer/) and play with it! As we discussed, the stemmer makes two types of mistakes:

(a) Give 5 examples of pairs of words that should have the same root but that the Porter stemmer stems to different roots, e.g., “matrix” and “matrices.”
(b) Give 5 examples of pairs of words that should have different roots but that the Porter stemmer stems to the same root, e.g., “universe” and “university.”

You may not use examples from our class lecture or slides, of course. Print the output of the stemmer on your examples.

Problem 3 (10 points)
Consider a message alphabet of 27 characters comprised of the 26 English letters and space.

A. Assume that all 27 characters have equal frequencies of occurrence (1/27 each). What is the entropy of a message?

B. Assume now these actual frequencies of occurrence (the numbers are per 1000 letters):

space	A	B	C	D	E	F	G	H	I	J	K	L	M
185.9	64.2	12.7	21.8	31.7	103.1	20.8	15.2	46.7	57.5	0.8	4.9	32.1	19.8

N	O	P	Q	R	S	T	U	V	W	X	Y	Z
57.4	63.2	15.2	0.8	48.4	51.4	76.9	22.8	8.3	17.5	1.3	16.4	0.5

What is the entropy of a message?

C. Can you draw any conclusions from the previous two results ?

D. What is the average length of a word? (Hint: Consider the probability of seeing a space.)

Problem 4 (10 points)
According to Heaps' Law, what proportion of a corpus (a collection of texts) must be read before 90% of its vocabulary has been encountered (assume beta=0.5.) Hint: To solve this problem you don't need to know the value of K.

Problem 5 (10 points)

Term-Document weights courtesy Huahai Yang http://www.albany.edu/~hy973732/

Given a query Q and a collection of documents A, B and C shown in table 1, determine the similarity of A, B, C to Q using language modeling. Determine the probability of generating the query from the language model associated with the documents using the simple multinomial model and the following smoothing techniques:

(a) No smoothing, i.e., maximum likelihood language model.

(b) Laplace smoothing.

(c) Witten-Bell smoothing. For the corpus model, use the average of the maximum likelihood models created in part (a), e.g., the corpus probability of “cat” is the average of the maximum likelihood probabilities associated with “cat” in documents A, B, and C.

The numbers in the table are the weights between each term and document.

Table 1 Term-Document Weights

	Cat	Food	Fancy
Q	3	4	1
A	2	1	0
B	1	3	1
C	0	2	2

Problem 6 (10 points)

Courtesy Huahai Yang http://www.albany.edu/~hy973732/

Convert the raw frequency weights in Table 2 to Okapitf*idf weights. Use formula Okapitf = tf/(tf+K) with K=1.5

Table 2 Raw Term Frequency for a Collection of Documents

	Cape	Whale	Nightmare	Arm	Door
A	0	3	0	0	0
B	1	1	0	1	5
C	3	3	1	1	9
D	0	0	1	9	0