important Submit by committing your files to svn. See the checklist below to make sure you turn everything in.
We have discussed in lecture how to find two documents which are exact duplicates using a simple checksum of the entire contents, and also how to find documents which are near duplicates by using a similarity hash on the documents. Now consider the problem of finding a short passage within a document which is a near duplicate of some passage in another document, even if the remainder of the two documents are not similar at all.
Imagine that you get a job as a TA for a Computer Science course. When the course instructor hears of your excellent grades in Information Retrieval, she asks you to help her solve her plagiarism problems. Fortunately, she has massive amounts of data to aid you, in the form of hundreds of terabytes of previous student assignments and web pages from which they might copy their answers. Unfortunately, her lab has limited computational power, so algorithmic efficiency is key. Your task is to design a system which preprocesses this data somehow and then, given a batch of submitted student homework assignments, finds submissions with passages which are likely to have been copied either from one of the stored documents or from another of the submitted assignments. Your system must work efficiently enough that it works on the computers in your professor's lab, and completes in time to hand out grades.
[Courtesy James Allan]
In this homework problem, you will write a quick program to explore Zipf's Law.
Go to the Project Gutenberg website and download Alice in Wonderland by Lewis Carol (Plain Text UTF-8 format). Strip off the header, and thus consider only the text starting at "ALICE'S ADVENTURES IN WONDERLAND", just preceding "CHAPTER 1"; also, strip off the footer, eliminating the license agreement and other extraneous text, and thus consider only the text up through, and including, "THE END". Use this perl script to strip the text of punctuation obtaining the original text as a list of words. For example, on a unix based systems, you should run a command like:
parse.pl alice30.txt > output
Use the result numbers associated with a web search engine query to do the following: