CS6200: Information Retrieval
Return to basic course information.
Assigned: Thursday, 19 September 2013
Due: Thursday, 3 October 2013, 6pm
- If you collaborated with others, you must write down with whom
you worked on the assignment. If this changes from problem to
problem, then you should write down this information separately
with each problem.
- Submit the requested written answers, code, and instructions to
the TAs on how to (compile and) run the code.
- [10 points] In the lectures on text acquisition, we discussed
duplicate detection techniques such as fingerprinting and locality
sensitive hashing (simhash). Plagiarism detection is similar but
often involves finding passages in documents that are
duplicates, or near duplicates, of passages in other
documents. (For instance, your roommate may be copying your answer
to this question but may think his own answer to question #3 is
much better.) Assuming you have a large database of original
documents available, describe in detail a design for a system that
detects plagiarized passages in a target document.
- [15 points] [Courtesy James Allan]
- In this homework problem, you will write a quick program to
explore Zipf's Law.
Go to the Project Gutenberg
in Wonderland by Lewis Carol (Plain Text UTF-8
format). Strip off the header, and thus consider only the
text starting at "ALICE'S ADVENTURES IN WONDERLAND", just
preceding "CHAPTER 1"; also, strip off the footer, eliminating
the license agreement and other extraneous text, and thus
consider only the text up through, and including, "THE END".
Use this perl script to strip the text
of punctuation obtaining the original text as a list of
words. For example, on a unix based systems, you should run a
parse.pl alice30.txt > output
Write a quick program or script that counts word frequencies.
For the most frequent 25 words and for the most frequent
25 additional words that start with the letter f
(a total of 50 words), print the word, the number of times it
occurs, its rank in the overall list of words, the probability
of occurrence, and the product of the rank and the
probability. Also indicate the total number of words and the
total number of unique words that you found. Discuss whether
this text satisfies Zipf's Law. Feel free to use other parts
of the ranked list of terms.
- Suppose that while you were building retrieval index,
you decided to omit all words that occur fewer than five times (i.e.,
one to four times). According to Zipf's Law, what proportion of the
total words in the collection would you omit?
(Justify your answer.) What proportion would actually be
omitted in the Alice in Wonderland text above?
- [15 points]
- According to Heaps' Law, what proportion of a collection of
text must be read before 90% of its vocabulary has been
encountered? You may assume that
beta=0.5. Hint: to
solve this problem you don't need to know the value
- [This part is 15 additional points of extra
credit.] Verify Heap's Law on the Alice in
Wonderland text. Process each word of the text, in order,
and compute the following pairs of numbers: (number of words
processed, number of unique words seen). These pairs of numbers,
treated as (x,y) values, should satisfy Heaps Law.
Appropriately transform the data and use least squares to
determine the model parameters K and beta. As explained in
class, there are better ways of estimating a power-law
distribution, but this quick-and-dirty hack is often
- [10 points] Use the result numbers associated with a web search
engine query to do the following:
- Think up two 3-word queries and submit them to a search
engine. Tell us which search engine you used. Estimate the
number of results for these two queries using numbers from
single words and pairs of words contained in the query. Compare
them to the numbers returned by the search engine. Discuss
- Estimate the size of the search engine's indexed
corpus. Compare the size estimates from the two queries and
discuss the consistency (or lack of it) of these estimates.