ISU535 09X2 Project Phase I

Created: Tue 7 Jul 2009
Last modified: 

Assigned: Tue 07 July 2009
Due: No due date (recommended by Mon 13 Jul 2009)


Building an IR system

In this project you will implement a number of different ranking functions, i.e. algorithms that given a user's request (query) and a corpus of documents assign a score to each document according to its relevance to the query. Some of these ranking functions will be the implementation of the basic retrieval models studied in the class (e.g. TF-IDF, BM25, Language Models with different Smoothing) while others may be more complicated (e.g. incorporating PageRank, Metasearch over different basic ranking functions, Relevance Feedback, Query Expansion, Learning-to-Rank). In any case, we will provide you with the necessary infrastructure so you can have access to any statistics (such as Term Frequencies, Document Frequencies, Collection Term Frequencies, etc.) you need, without having to implement the retrieval system from scratch.

The project is split into three different phases. During the first phase (Phase I) you will have the chance to get familiar with the interface we provide you to access all necessary statistics (web interface). This means that you will need (1) to understand the different options provided by the interface, (2) to access this interface from the language of your choice, (3) read the data from the returned output of the interface, and (4) store them in the appropriate data structures. During the second phase (Phase II) you will have to implement a number of basic ranking functions, run a set of queries over the provided document corpus and finally evaluate the effectiveness of the different ranking functions. Finally, during the third phase you will have to enhance the basic retrieval models in a number of different ways.


Phase I

This project requires that you construct a search engine that can run queries on a corpus and return a ranked list of documents. Rather than having you build a system from the ground up, we are providing a Web interface to an existing search engine. That interface provides you with access to corpus statistics, document information, inverted lists, and so on. You can access all the necessary information via the provided Web interface, thus you can implement your project in any language that provides you with Web access, and can run it on any computer that has access to the Web. Computers in the CCIS department will work just fine, but you are more than welcome to use your own computer.

One of the nice things about the CGI interface is that you can get an idea of what it does using any browser. For example, the following CGI command:

http://fiji4.ccs.neu.edu/~zerg/lemurcgi_IRclass/lemur.cgi?v=encryption


will get you a list of the documents that contain the word "encryption". Go ahead and click on the link to see the output (in a separate window). It lists the number of times the term occurs in the collection (ctf), the total number of documents that contain it (df) and then, for each of the documents containing the term, it lists the documents id, it's length (number of terms), and the number of times the term occurs in that document.

The format is easy for a human to read and should be easy for you to write a parser for. However, to make it even simpler, the interface provides a "program" version of its interface that strips out all of the easy-to-read stuff and gives you a list of numbers that you have to decipher on your own (mostly it strips the column headings). Here is a variation of the link above that does that. Click on it to see the difference.

http://fiji4.ccs.neu.edu/~zerg/lemurcgi_IRclass/lemur.cgi?g=p&v=encryption


From class, it should be pretty straightforward to imagine how to create a system that reads in a query, gets the inverted lists for all words in the query, and then comes up with a ranked list. You may need other information, and the interface makes everything accessible (we hope). Hema's wiki page contains a rundown of some of the commands, and there is a help command, too:

http://fiji4.ccs.neu.edu/~zerg/lemurcgi_IRclass/lemur.cgi?h=


Play around with the engine to ensure it makes sense.

Two important tips:

First, fiji4.ccs.neu.edu is not accessible via name from all CCIS internal networks; you may need to use the (local) IP address 10.0.0.176 instead. From outside CCIS, the fiji4.ccs.neu.edu name should always work.
Second, in order to get efficient processing out of this server, you should string as many commands together as you can. For example, if you want to get statistics information for three terms, ask for them all at once as in:

http://fiji4.ccs.neu.edu/~zerg/lemurcgi_IRclass/lemur.cgi?v=encryption&v=star&v=retrieval

(If you use the program mode version of the same command Lemur images are dropped.) This is because there is some overhead getting the process started, and putting commands together amortizes that over several requests. It works to do them separately, but it might take longer.

This first phase of the project has neither deliverables nor a due date. However, bearing in mind that in Phase II you will need to implement a number of different ranking functions and thus calculate things like TF-IDF scores, in this first phase of the project you should be able, given a query, to access the CGI via some programming language, ask for all the statistics required in computing scores like TF-IDF, parse the returned result in order to obtain these statistics and finally store all these statistics in the appropriate data structures. You can find a small list of queries you can use to test your code here. These queries need also to be parsed.


Switch to:


ekanou@ccs.neu.edu