IS4200/CS6200 12S Final Project

Created: Wed 04 Apr 2012
Last modified:

Assigned: Wed 04 Apr 2012
Due: Fri 20 Apr 2012

Indexing

In this project, you will replicate the functionality of the Lemur index used in Project 1, and in conjuction with the code you created implementing various retrieval functions for Project 1, you will have created a fully functioning search engine.

The Project

Download the CACM collection, which is available from at least two sources: (1) the Search Engines: Information Retrieval in Practice test collection site and (2) the Glasgow test collection site.
Create an index of the CACM collection, together with code replicating the functionality of the Lemur index used in Project 1.
Notes:
- From the "Search Engines" book site, you should use the .tar.gz version. The .corpus version is in a special format meant for use by the (book's) Galago search engine. You should also look at the version from the Glasgow site.
  The .tar.gz version from the "Search Engines" book site has the advantage that there is one file per CACM document, but there is some HTML formatting. The Glasgow version has one large file with all the documents concatenated (and no HTML). This may be an advantage as well, depending on how you decide to parse the data.
  You will note that the CACM "documents" are actually just abstracts of full articles, and, in many cases, not much more than titles.
  Finally, you can ignore the columns of numbers. This is an encoding of bibliographic references, I believe. Just index the text.
- When creating your index, you should first apply stop-wording. You may do so using either the stop-word list from Project 1 or the SMART system stop word list available with the CACM collection from the Glasgow site.
- When creating your index, you should then apply stemming. You may do so using any reasonable stemmer, such as the Porter stemmer or the KStem stemmer, each of which are freely available on the web: Porter stemmer and KStem stemmer.
- Next, you should create an inverted index of the CACM collection documents, as described in class. The index will typically consist of multiple files: (1) a file that maps term names to term IDs and associated term information, such as inverted index offset and length values (see below) and corpus frequency statistics (2) the inverted index file that maps term IDs to document IDs and associated term frequencies, and (3) a file that maps document IDs to document names and associated document information, such as document lengths.
  The inverted index file constitutes the bulk of the index. For simplicity, you can build up to this file in stages:
  1. As you process documents, maintain a separate file per unique term, adding document information to these files as you go.
  2. Concatenate these files into one inverted index file and add the appropriate inverted index offset and length values to the term information file.
- Finally, create code that, given a specified term and the index files above, replicates the functionality of the Lemur index used in Project 1.
Now, using your index and your code from Project 1, perform retrieval experiments on all CACM queries using all five retrieval algorithms from Project 1. Record and report mean average precision and mean precision at cutoff 10 and 30 results, as you did for Project 1. You should, of course, use the queries and qrel file that come with the CACM collection.
As with Project 1, experiment with various retrieval formulae parameters, such as the smoothing parameter, and compare and contrast your results here with those from Project 1: Do the same retrieval formulae work best? Do the optimal retrieval formulae parameters change? And so on.

What to Submit

Create a report describing your system and the results and analyses requested above.
Submit a copy of your code via e-mail.
Arrange for a time to demo your system for the TAs.