Return to basic course information.
Assigned: Thursday, 15 November 2012
Due: Monday, 10 December 2012, 6pm
New! Extra credit points available.
In this project, you will replicate the functionality of the Lemur index used in Project 2, and in conjuction with the code you created implementing various retrieval functions for Project 2, you will have created a fully functioning search engine.
Download the CACM collection from the Search Engines: Information Retrieval in Practice test collection site.
Create an index of the CACM collection, together with code replicating the functionality of the Lemur index used in Project 2.
From the "Search Engines" book site, you should use the .tar.gz version. The .corpus version is in a special format meant for use by the (book's) Galago search engine.
You will note that the CACM "documents" are actually just abstracts of full articles, and, in many cases, not much more than titles. Before disk was cheap, many retrieval systems used no more than this.
Finally, you can ignore the columns of numbers. This is an encoding of bibliographic references. Just index the text.
When creating your index, you should first apply stop-wording using the stop-word list from Project 2.
When creating your index, you should then apply stemming. You may do so using any reasonable stemmer, such as the Porter stemmer or the KStem stemmer, each of which are freely available on the web: Porter stemmer and KStem stemmer.
Next, you should create an inverted index of the CACM collection documents, as described in class. The index will typically consist of multiple files: (1) a file that maps term names to term IDs and associated term information, such as inverted index offset and length values (see below) and corpus frequency statistics (2) the inverted index file that maps term IDs to document IDs and associated term frequencies, and (3) a file that maps document IDs to document names and associated document information, such as document lengths.
The inverted index file constitutes the bulk of the index. For simplicity, you can build up to this file in stages:
Finally, create code that, given a specified term and the index files above, replicates the functionality of the Lemur index used in Project 2.
Now, using the index you just built and your code from Project 2, perform retrieval experiments on all CACM queries using all five retrieval algorithms from Project 2. Record and report mean average precision and mean precision at cutoff 10 and 30 results, as you did for Project 2. You should, of course, use the queries and qrel file that come with the CACM collection. Note that you should use the "raw" queries, not the "processed" ones. This will allow you to stopword and stem your queries in exactly the same way you did the documents when indexing.
As with Project 2, experiment with various retrieval formula parameters, such as the smoothing parameter, and compare and contrast your results here with those from Project 2: Do the same retrieval formulae work best? Do the optimal parameters change? And so on.
For a maximum of 50 extra points, consider the following:
Many modern search engines end up indexing even stop words. Disk is cheap! But what are the tradeoffs? For extra credit, analyze the empirical time and space complexity of including stopword information in the index:
Note that you should still stem the document and query terms. Points will be assigned for a clear description of the approach and presentation of the results.
The main assignment is worth 150 points. The extra credit portion is worth at most 50 extra points (for a maximum total of 200).