CS6200: Homework 2

CS6200 Homework Two

Assigned:: Wednesday, January 22, 2014
Due:: 10:00pm, Saturday, February 1, 2014

important Submit by committing your files to svn. See the checklist below to make sure you turn everything in.

Problem One

We have discussed in lecture how to find two documents which are exact duplicates using a simple checksum of the entire contents, and also how to find documents which are near duplicates by using a similarity hash on the documents. Now consider the problem of finding a short passage within a document which is a near duplicate of some passage in another document, even if the remainder of the two documents are not similar at all.

Imagine that you get a job as a TA for a Computer Science course. When the course instructor hears of your excellent grades in Information Retrieval, she asks you to help her solve her plagiarism problems. Fortunately, she has massive amounts of data to aid you, in the form of hundreds of terabytes of previous student assignments and web pages from which they might copy their answers. Unfortunately, her lab has limited computational power, so algorithmic efficiency is key. Your task is to design a system which preprocesses this data somehow and then, given a batch of submitted student homework assignments, finds submissions with passages which are likely to have been copied either from one of the stored documents or from another of the submitted assignments. Your system must work efficiently enough that it works on the computers in your professor's lab, and completes in time to hand out grades.

Create a block diagram showing the major components of your system and their relationships.
Describe your algorithm using a flow chart, pseudocode, or similar approach. Provide enough detail that a first year grad student would understand it, even if you weren't there to answer questions.
Write an explanation of your approach in no more than one page. You will be graded in part on the neatness of your submission and the level of professionalism in your writing. Your explanation should address (at least) the following issues:
- What is the algorithmic complexity of your approach, in terms of the total number of documents you're searching? You don't need to rigorously prove the complexity, but you should justify your answer.
- Is your program guaranteed to find matches, or is there some chance it will miss something? If it might miss something, please provide a general idea of its accuracy.
- Can your system identify passages of various lengths? What is the shortest or longest passage that it can identify?
- Do you need to worry about spam? If so, what can you do about it? If not, why not?

Problem Two

[Courtesy James Allan]

In this homework problem, you will write a quick program to explore Zipf's Law.

Go to the Project Gutenberg website and download Alice in Wonderland by Lewis Carol (Plain Text UTF-8 format). Strip off the header, and thus consider only the text starting at "ALICE'S ADVENTURES IN WONDERLAND", just preceding "CHAPTER 1"; also, strip off the footer, eliminating the license agreement and other extraneous text, and thus consider only the text up through, and including, "THE END". Use this perl script to strip the text of punctuation obtaining the original text as a list of words. For example, on a unix based systems, you should run a command like:

parse.pl alice30.txt > output

Write a quick program or script that counts word frequencies. For the most frequent 25 words and for the most frequent 25 additional words that start with the letter f (a total of 50 words), print the word, the number of times it occurs, its rank in the overall list of words, the probability of occurrence, and the product of the rank and the probability. Also indicate the total number of words and the total number of unique words that you found. Discuss whether this text satisfies Zipf's Law. Feel free to use other parts of the ranked list of terms.
Suppose that while you were building retrieval index, you decided to omit all words that occur fewer than five times (i.e., one to four times). According to Zipf's Law, what proportion of the total words in the collection would you omit? (Justify your answer.) What proportion would actually be omitted in the Alice in Wonderland text above?

Problem Three

According to Heaps' Law, what proportion of a collection of text must be read before 90% of its vocabulary has been encountered? You may assume that beta=0.5. Hint: to solve this problem you don't need to know the value of K.
[Extra credit] Verify Heap's Law on the Alice in Wonderland text. Process each word of the text, in order, and compute the following pairs of numbers: (number of words processed, number of unique words seen). These pairs of numbers, treated as (x,y) values, should satisfy Heaps Law. Appropriately transform the data and use least squares to determine the model parameters K and beta. As explained in class, there are better ways of estimating a power-law distribution, but this quick-and-dirty hack is often effective.

Problem Four

Use the result numbers associated with a web search engine query to do the following:

Think up two 3-word queries and submit them to a search engine. Tell us which search engine you used. Estimate the number of results for these two queries using numbers from single words and pairs of words contained in the query. Compare them to the numbers returned by the search engine. Discuss results.
Estimate the size of the search engine's indexed corpus. Compare the size estimates from the two queries and discuss the consistency (or lack of it) of these estimates.

Submission Checklist

Problem 1: The diagram
Problem 1: The flowchart/algorithm
Problem 1: Your explanation
Problem 2: Your source code
Problem 2a: Your word frequency results
Problem 2a: Your comments on the results
Problem 2b: Your comments
Problem 3a: Your comments
Problem 3b: (Extra credit) Your results and comments
Problem 4a: Your queries, results, and comments
Problem 4b: Your queries, estimate, and comments

Rubric

Problem 1: 30 Points

5 points:
Diagram contains all necessary components, and relationships make sense.

10 points:
Algorithm flowchart/pseudocode is clear, and accomplishes the task correctly.

10 points:
Writeup explains the approach well and addresses all concerns stated in the assignment.

5 points:
Presentation and writing quality. Diagrams and pseudocode are clear, and written explanation takes no more than one page.
Problem 2: 30 Points

15 points:
Your results on 2a make sense, and support your comments.

15 points:
Your calculation and comments are correct.
Problem 3: 15 Points (+20)

15 points:
Your calculation is correct.

+20 points:
Your extra credit analysis is correct.
Problem 4: 25 Points

15 points:
Your calculations are correct, and your queries are suited to the task.

10 points:
Your calculations are correct, and your queries are suited to the task.