ISU535 06X1 Information Retrieval
Homework 2
Created: Mon 2 May 2005
Last modified:
Make sure you check the syllabus for the due date.
Submit on paper, at the beginning of the class on the day the homework is due. We encourage you to type it down (txt or pdf) rather than writing by hand.
Problem
1 (60 points)
Courtesy James Allan
http://www-ciir.cs.umass.edu/~allan/
A. In
this homework problem, you will write a quick program to explore
Zipf’s Law.
Go to Project Gutenberg’s
(http://www.gutenberg.org)
and select Alice in Wonderland by Lewis Carol at
http://www.gutenberg.org/dirs/etext91/alice30.txt
(ignore the header, consider only
the text starting at "ALICE'S ADVENTURES IN WONDERLAND").
Use this perl script to strip the text of
punctuation obtaining the original text as a list of words. For
example, on a unix base systems, you should run a command like
parse.pl alice30.txt > outputfile
Write a quick program or script that
counts words frequency of occurrence. For the most frequent 25 words
and for the most frequent 25 additional words that start with
the letter c (a total of 50 words), print the word, the number
of times it occurs, its rank in the overall list of words, the
probability of occurrence, and the product of the rank and the
probability. Also indicate the total number of words and the total
number of unique words that you found. Discuss whether this text
satisfies Zipf’s Law. Feel free to use other parts of the ranked
list of terms.
B. Suppose that while you were building an inverted file, you omitted all words that occur fewer than five times (i.e., one to four times). According to Zipf’s Law, what proportion of the total words in the collection would you not be indexing (justify your answer)? What proportion would actually be omitted in the Alice in Wonderland text above?
Problem
2 (10 points)
Download the Porter Stemmer
(http://www.tartarus.org/~martin/PorterStemmer/)
and play with it! As you already know, the stemmer makes two types of
mistakes:
(a) Give 5 examples of pairs of words that should have
the same root but that the Porter stemmer stems to different roots,
e.g., “matrix” and “matrices.”
(b) Give 5 examples of
pairs of words that should have different roots but that the Porter
stemmer stems to the same root, e.g., “universe” and
“university.”
DO NOT use examples from class lecture. Print
the output of the stemmer on your examples.
Problem
3 (25 points)
Prof Jay Aslam has a list of songs on
his iPod, each song played over the last year a number of times, as
indicated in this ranked list
The
left plot below plots (blue dots) for each song the number of times
played (Y axis) , ranked on the X axis descending. The red line is a
fitted Zipfian distribution.
The right plot is the same as the
left one with both axes on logarithmic scale; as shown in class, the
Zipfian function becomes a straight line
|
|
|
|
figure 1a): the number of times each song is played (Y axes) ranked descending (X axes); red line is the fitted Zipfian distribution. |
figure 1b) same plot as figure 1a) with axes on logarithmic scale; the Zipfian function becomes a line. |
Your task is to find the best fit (straight line) for the logarithmic scale representation of data. Consult http://mathworld.wolfram.com/LeastSquaresFitting.html for a complete discussion on least squares fitting. If data points are represented by
![]()
and
you are looking for the linear function (straight line) given by
|
|
then
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
For the given points, compute the fitting coefficients m and b.
Problem
4 (15 points)
Consider the alphabet of the 27
letters (including space)
of the English language and “messages” that consist of single
letters.
A.
Assume first that all 27 letters have equal
frequencies of occurrence (1/27 each). What is the entropy of a
message?
B. Assume now these actual frequencies of occurrence (the numbers are per 1000 letters):
|
space |
A |
B |
C |
D |
E |
F |
G |
H |
I |
J |
K |
L |
M |
|
185.9 |
64.2 |
12.7 |
21.8 |
31.7 |
103.1 |
20.8 |
15.2 |
46.7 |
57.5 |
0.8 |
4.9 |
32.1 |
19.8 |
|
N |
O |
P |
Q |
R |
S |
T |
U |
V |
W |
X |
Y |
Z |
|
57.4 |
63.2 |
15.2 |
0.8 |
48.4 |
51.4 |
76.9 |
22.8 |
8.3 |
17.5 |
1.3 |
16.4 |
0.5 |
What is the entropy of a message?
C.Can you draw any conclusions from the previous two results ?
D.What is the average length of a word? (Hint: Consider the probability of seeing a space.)
Problem
5 (25 points)
A. According to Heaps' Law, what
proportion of a corpus (a collection of texts) must be read before
90% of its vocabulary has been encountered (assume beta=
0.5; hint: to solve this problem you
don't need to know the value of K).
B. Verify Heaps Law on the Alice in Wonderland text. Process each word of the text, in order, and compute the pairs of numbers: (number of words processed, number of unique words seen). These pairs of numbers, treated as (x,y) values, should satisfy Heaps Law. Appropriately transform the data and use Least Squares to determine the model parameters K and beta.
(Hint,
example : Go through each word of the text, one-by-one. Keep
track of how many words you have gone through ("words
processed") and how many unique words you have seen up to that
point. These are the pairs of numbers. For example, given
the text: the cat in the
hat you will have seen one unique word ("the")
after processing the first word in the text ("the"), two
unique words ("the" and "cat") after processing
the first two words, etc. The pairs of numbers generated will
thus be:
(1,1)
(2,2)
(3,3)
(4,3)
(5,4)
Note
the fourth pair (4,3) --- after processing the fourth word, you will
have only seen three unique words since the second "the"
had appeared earlier... )
Problem 6 (20 points)
Term-Document weights courtesy Huahai Yang http://www.albany.edu/~hy973732/
Given a query Q and a collection of documents A, B and C shown in table 1, determine the similarity of A, B, C to Q using language modeling. Determine the probability of generating the query from the language model associated with the documents using the simple multinomial model and the following smoothing techniques:
(a) No smoothing, i.e., maximum likelihood language model.
(b) Laplace smoothing.
(c) Witten-Bell smoothing. For the corpus model, use the average of the maximum likelihood models created in part (a); e.g., the corpus probability of “cat” is the average of the maximum likelihood probabilities associated with “cat” in documents A, B, and C.
The numbers in the table are the weights between each term and document.
Table 2 Term-Document Weights
|
|
Cat |
Food |
Fancy |
|
Q |
3 |
4 |
1 |
|
A |
2 |
1 |
0 |
|
B |
1 |
3 |
1 |
|
C |
0 |
2 |
2 |
Problem 7 (15 points)
Courtesy Huahai Yang http://www.albany.edu/~hy973732/
Convert the raw frequency weights in table 2 to tf*idf weights.
Table 3 Raw Term Frequency for a Collection of Documents
|
|
Cape |
Whale |
Nightmare |
Arm |
Door |
|
A |
0 |
3 |
0 |
0 |
0 |
|
B |
1 |
1 |
0 |
1 |
5 |
|
C |
3 |
3 |
1 |
1 |
9 |
|
D |
0 |
0 |
1 |
9 |
0 |