CSG339 Information Retrieval

Project 1: Questions & Answers

Created: Wed 20 Sep 2006
Last modified:

Q : Why is http://kerf.ccs.neu.edu/ not working ? [obsolete]

IMPORTANT : from "community ports" http://kerf.ccs.neu.edu/ is not accessible (we are working on this). Instead use the NAT http://10.0.0.204/ . However, this may be a permanent problem; as a result, if u want your code to work both from community ports and from other net outlets(including outside CCIS network) you should :

use a variable for the server name, that can be changed easily
detect from code if your machine is attached to a community port or not (if you go through NOMAD you are on a community port); alternatively you can change the server name manually when you start a new session.

If you've never called a URL from the language X of your choice (C/C++/Perl etc) and you are wondering how you are ever going to get past the first hurdle of figuring out what libraries exist for X then I suggest using the wget command for Unix

If you are a more adventurous Perl user check out the libwww library for Perl.

Q. Can I call the server manually for each query ?

Ans. Although the project is meant for you to use an automatic handling of html source (like wget followed by some parsing of the file) you can do it manually (not recommended) : there are 4 systems x 4 databases x 25 queries = 400 http calls, if you manage to call the server only once per query. You will have to copy-paste the statistics from the browser creating 400 files used afterwards to rank the documents. If you choose this option please state clearly "MANUAL HTTP CALLS" on top of the first page of the report. Note that this will result in a penalty of 10% of the grade (ie maximum obtainable are 90 points).

Q.My understanding of the vector space model is that information is needed on the term frequency for every term in the corpus. I'm getting this from slides 8 and 9 of the 4th set of class slides, where the formula needs the length of the document vector, which is the square root of the term frequency for each term in the document squared, including for terms not in the query.
I can't find any easy way to get this information from lemur. For a query term like 'the', if I open and parse every single document to get the tf of each term in the document to fill in the vector space, then I calculate that a single query will take nearly 17 days to complete. I hope that's not right.
Is there a way we can reduce the document vector to unit length without having to fill in the entire vector for the document .. ie, without having to know the frequency of each term that appears in the document? Or is there a way to get this from lemur?

Ans. So you could just compute the numerator and rank by that. I'll see if I can provide a normalized inverted list.

Q. I just started to do some basic work on the Project. I just wrote a couple of Java files reading the inverted lists over the provided CGI interface.
Two things came up:
- The output-format in the "P"-Mode is really weird: There's a couple of HTML-Code added without any reason. The Content-Type of the document is text/html. However, I managed to parse the List, but a comma-separated list without any markup code would have beed easier to use smile
- A bigger problem is Lemur's speed: It is really very slow if I try to read a inverted list that is huge, especially when I pass over a stopword. Testing 50 queries will take hours to complete with this bottleneck.

Ans: I fixed the first problem for the 'c' option. Ill soon do for the others. We'll set up more servers for the cgi and hopefully you guys should be ok.

Q. Many of the TREC queries involve 'or' and/or 'and'. Should we explicitly model these as operators in our programs?

Ans: You could implement that, but its not expected. Simple bag of words is fine.

Q. What is the output format for the search engine?

Ans.The standard output is as follows:

[qid] Q0 [DocID] [rank] [score] Exp

qid=query id
Doc ID =document ID
rank= rank of this document for this query.
score = score of your retrieval algorithm

Q0 and Exp are constants you have to output on each line (dont worry about why this is the case) for the evaluation script to run. The results file should be sorted in order of increasing qids, and for each qid in the order of increasing rank and therefore decreasing score. Here is a sample results file for another corpus and another set of queries.

Documents and Square of the Document lengths for each of the databases.

Database 0: d=0.doclen

Database 1: d=1.doclen

Database 2: d=2.doclen

Database 3: d=3.doclen