important Submit by committing your files to svn. See the checklist below to make sure you turn everything in.
In this project, you will write an indexer and use it to index a collection of web documents. In the next project, you will create a ranker that uses your index to process queries. If you combine these two projects with the crawler you have already written, you will have constructed a simple search engine.
As you write this project, it's a good idea to think of which components you'll need in order to write a ranker later on and to design your code accordingly. A little extra thought now can save you some trouble later.
For this assignment, you will be using this 48MB document collection (your CCIS password is required).
important Please do not commit these documents to Subversion! We already have a copy.You will write three programs:
pr1
, and include a file at pr1/README.txt
which explains how to run your programs and whether you attempted the extra credit. These programs are detailed below. Note that you must compress your output files before storing them in Subversion. You should not compress your source code.
The first step in creating an index is tokenization. You must convert a document into a stream of tokens suitable for indexing. Your tokenizer should follow these steps:
\w+(\.?\w+)*
1234\tclueweb12-0000tw-13-04988
567\tasparagus
1234\t567\t1\t3\t12\t42
Extra credit: Implement your program using constant memory, with respect to the number of documents and the length of a document. This more realistic requirement would allow you to index any number of documents of any length. It is OK for memory usage to grow with respect to vocabulary size: we expect from Heap's Law that we would eventually run out of memory, so this would not work for a real Internet crawl, but that our vocabulary size will grow very slowly with respect to the number of documents we process.
The final steps in index construction are inverting the index and preparing for fast random access to terms' inverted lists. Write a program which reads doc_index.txt to produce the following files.
567\t1234:9\t1234:13\t1240:3\t1240:7
567\t1234:9\t0:4\t6:3\t0:4
567\t1542\t567\t315
Extra credit: Implement your program using constant memory with respect to the number of documents and term positions.
Extra credit: Implement your program in linear time with respect to the length of doc_index.txt, and using constant memory with respect to the number of documents, term positions, and terms. This stricter requirement is worth more points.
Now that you have an inverted index of the corpus, you'll want to be able to do something with it. This is mostly left for the next project. For now, we will just write the code to pull up some statistics from the index. Write a program which implements the following command line interface. Your program must not scan the inverted index linearly; it must look up the offset in term_info.txt and jump straight to the correct inverted list.
Keep in mind as you design this program that you will be reusing much of this code in the next project.
You can call the program anything you like, and in Java your command will look slightly different. Note that the values in the output examples below are made up.
Passing just --doc DOCNAME
will list the following document information:
$ ./read_index.py --doc clueweb12-0000tw-13-04988
Listing for document: clueweb12-0000tw-13-04988
DOCID: 1234
Distinct terms: 25
Total terms: 501
Passing just --term TERM
will stem the term and then list the following term information:
$ ./read_index.py --term asparagus
Listing for term: asparagus
TERMID: 567
Number of documents containing term: 315
Term frequency in corpus: 567
Inverted list offset: 1542
Passing both --term TERM
and --doc DOCNAME
will show the inverted list for the document/term:
$ ./read_index.py --term asparagus --doc clueweb12-0000tw-13-04988
Inverted list for term: asparagus
In document: clueweb12-0000tw-13-04988
TERMID: 567
DOCID: 1234
Term frequency in document: 4
Positions: 134, 155, 201, 233
We will evaluate your program by running these commands for selected documents and terms.
Remember to submit your files in a folder named pr1
. In order to receive the points for any extra credit you attempt, you must say in README.txt which extra credit problems you have attempted.