ISU535 06X1 Information Retrieval

Homework 2

Created: Mon 2 May 2005
Last modified: 

Make sure you check the syllabus for the due date.

Submit on paper, at the beginning of the class on the day the homework is due. We encourage you to type it down (txt or pdf) rather than writing by hand.

 

Problem 1 (60 points)
Courtesy James Allan
http://www-ciir.cs.umass.edu/~allan/ 

A. In this homework problem, you will write a quick program to explore Zipf’s Law.
Go to Project Gutenberg’s (
http://www.gutenberg.org) and select Alice in Wonderland by Lewis Carol at http://www.gutenberg.org/dirs/etext91/alice30.txt (ignore the header, consider only the text starting at "ALICE'S ADVENTURES IN WONDERLAND"). Use this perl script to strip the text of punctuation obtaining the original text as a list of words. For example, on a unix base systems, you should run a command like
parse.pl alice30.txt > outputfile
Write a quick program or script that counts words frequency of occurrence. For the most frequent 25 words and for the most frequent 25 additional words that start with the letter c (a total of 50 words), print the word, the number of times it occurs, its rank in the overall list of words, the probability of occurrence, and the product of the rank and the probability. Also indicate the total number of words and the total number of unique words that you found. Discuss whether this text satisfies Zipf’s Law. Feel free to use other parts of the ranked list of terms.

B. Suppose that while you were building an inverted file, you omitted all words that occur fewer than five times (i.e., one to four times). According to Zipf’s Law, what proportion of the total words in the collection would you not be indexing (justify your answer)? What proportion would actually be omitted in the Alice in Wonderland text above?

 

Problem 2 (10 points)
Download the Porter Stemmer (http://www.tartarus.org/~martin/PorterStemmer/) and play with it! As you already know, the stemmer makes two types of mistakes:
(a) Give 5 examples of pairs of words that should have the same root but that the Porter stemmer stems to different roots, e.g., “matrix” and “matrices.”
(b) Give 5 examples of pairs of words that should have different roots but that the Porter stemmer stems to the same root, e.g.,  “universe” and “university.”
DO NOT use examples from class lecture. Print the output of the stemmer on your examples.

 

Problem 3 (25 points)
Prof Jay Aslam has a list of songs on his iPod, each song played over the last year a number of times, as indicated in this ranked list
The left plot below plots (blue dots) for each song the number of times played (Y axis) , ranked on the X axis descending. The red line is a fitted Zipfian distribution.
The right plot is the same as the left one with both axes on logarithmic scale; as shown in class, the Zipfian function becomes a straight line





figure 1a): the number of times each song is played (Y axes) ranked descending (X axes); red line is the fitted Zipfian distribution.

figure 1b) same plot as figure 1a) with axes on logarithmic scale; the Zipfian function becomes a line.

Your task is to find the best fit (straight line) for the logarithmic scale representation of data. Consult http://mathworld.wolfram.com/LeastSquaresFitting.html for a complete discussion on least squares fitting. If data points are represented by




and you are looking for the linear function (straight line) given by



then












For the given points, compute the fitting coefficients m and b.





Problem 4 (15 points)
Consider the alphabet of the 27 letters (including
space) of the English language and “messages” that consist of single letters.


A.
Assume first that all 27 letters have equal frequencies of occurrence (1/27 each). What is the entropy of a message?

B. Assume now these actual frequencies of occurrence (the numbers are per 1000 letters):

space

A

B

C

D

E

F

G

H

I

J

K

L

M

185.9

64.2

12.7

21.8

31.7

103.1

20.8

15.2

46.7

57.5

0.8

4.9

32.1

19.8

 

N

O

P

Q

R

S

T

U

V

W

X

Y

Z

57.4

63.2

15.2

0.8

48.4

51.4

76.9

22.8

8.3

17.5

1.3

16.4

0.5

What is the entropy of a message?

C.Can you draw any conclusions from the previous two results ?

D.What is the average length of a word?  (Hint: Consider the probability of seeing a space.)

 

Problem 5 (25 points)
A
. According to Heaps' Law, what proportion of a corpus (a collection of texts) must be read before 90% of its vocabulary has been encountered (assume
beta= 0.5; hint: to solve this problem you don't need to know the value of K).

B. Verify Heaps Law on the Alice in Wonderland text.  Process each word of the text, in order, and compute the pairs of numbers:  (number of words processed, number of unique words seen).  These pairs of numbers, treated as (x,y) values, should satisfy Heaps Law.  Appropriately transform the data and use Least Squares to determine the model parameters K and beta.

(Hint, example : Go through each word of the text, one-by-one.  Keep track of how many words you have gone through ("words processed") and how many unique words you have seen up to that point.  These are the pairs of numbers.  For example, given the text: the cat in the hat you will have seen one unique word ("the") after processing the first word in the text ("the"), two unique words ("the" and "cat") after processing the first two words, etc.  The pairs of numbers generated will thus be:
(1,1)
(2,2)
(3,3)
(4,3)
(5,4)
Note the fourth pair (4,3) --- after processing the fourth word, you will have only seen three unique words since the second "the" had appeared earlier... )





Problem 6 (20 points)

Term-Document weights courtesy Huahai Yang  http://www.albany.edu/~hy973732/

Given a query Q and a collection of documents A, B and C shown in table 1, determine the similarity of A, B, C to Q using language modeling.  Determine the probability of generating the query from the language model associated with the documents using the simple multinomial model and the following smoothing techniques:

(a) No smoothing, i.e., maximum likelihood language model.

(b) Laplace smoothing.

(c) Witten-Bell smoothing.  For the corpus model, use the average of the maximum likelihood models created in part (a); e.g., the corpus probability of “cat” is the average of the maximum likelihood probabilities associated with “cat” in documents A, B, and C.

 

The numbers in the table are the weights between each term and document.

Table 2 Term-Document Weights

 

Cat

Food

Fancy

Q

3

4

1

A

2

1

0

B

1

3

1

C

0

2

2



 

Problem 7 (15 points)

Courtesy Huahai Yang  http://www.albany.edu/~hy973732/

Convert the raw frequency weights in table 2 to tf*idf weights.

Table 3 Raw Term Frequency for a Collection of Documents

 

Cape

Whale

Nightmare

Arm

Door

A

0

3

0

0

0

B

1

1

0

1

5

C

3

3

1

1

9

D

0

0

1

9

0