# IS4200/CS6200: Information Retrieval

## Homework 2

Assigned: Thursday, 18 October 2012
Due: Friday, 26 October 2012, 6pm

### Instructions

1. If you collaborated with others, you must write down with whom you worked on the assignment. If this changes from problem to problem, then you should write down this information separately with each problem.
2. Submit the requested written answers, code, and instructions to the TAs on how to (compile and) run the code.

### Problems

1. (15 points) [Courtesy James Allan]
1. In this homework problem, you will write a quick program to explore Zipf's Law.

Go to the Project Gutenberg website and download Alice in Wonderland by Lewis Carol (Plain Text UTF-8 format). Strip off the header, and thus consider only the text starting at "ALICE'S ADVENTURES IN WONDERLAND", just preceding "CHAPTER 1"; also, strip off the footer, eliminating the license agreement and other extraneous text, and thus consider only the text up through, and including, "THE END". Use this perl script to strip the text of punctuation obtaining the original text as a list of words. For example, on a unix based systems, you should run a command like
```parse.pl alice30.txt > output
```

Write a quick program or script that counts word frequencies. For the most frequent 25 words and for the most frequent 25 additional words that start with the letter f (a total of 50 words), print the word, the number of times it occurs, its rank in the overall list of words, the probability of occurrence, and the product of the rank and the probability. Also indicate the total number of words and the total number of unique words that you found. Discuss whether this text satisfies Zipf's Law. Feel free to use other parts of the ranked list of terms.

2. Suppose that while you were building retrieval index, you decided to omit all words that occur fewer than five times (i.e., one to four times). According to Zipf's Law, what proportion of the total words in the collection would you omit? (Justify your answer.) What proportion would actually be omitted in the Alice in Wonderland text above?
2. (15 points)
1. According to Heaps' Law, what proportion of a collection of text must be read before 90% of its vocabulary has been encountered? You may assume that `beta=0.5`. Hint: to solve this problem you don't need to know the value of `K`
2. [Update: This part is now 15 additional points of extra credit.] Verify Heap's Law on the Alice in Wonderland text. Process each word of the text, in order, and compute the following pairs of numbers: (number of words processed, number of unique words seen). These pairs of numbers, treated as (x,y) values, should satisfy Heaps Law. Appropriately transform the data and use least squares to determine the model parameters K and beta, in much the same manner as Zipf's Law example from class.
3. (10 points) Use the result numbers associated with a web search engine query to do the following:
1. Think up two 3-word queries and submit them to a search engine. Tell us which search engine you used. Estimate the number of results for these two queries using numbers from single words and pairs of words contained in the query. Compare them to the numbers returned by the search engine. Discuss results.
2. Estimate the size of the search engine's indexed corpus. Compare the size estimates from the two queries and discuss the consistency (or lack of it) of these estimates.
4. (10 points) Describe in detail two optimizations that could be used to speed up conjunctive document-at-a-time query processing. Explain why these optimizations are either “safe” or “unsafe” (also known as “lossless” or “lossy”). Under what circumstances is one of your optimizations better? Why?