IS4200/CS6200 10X1 Project 01 QA

Created: Thu 20 May 2010
Last modified:

If you've never called a URL from the language X of your choice (C/C++/Perl etc) and you are wondering how you are ever going to get past the first hurdle of figuring out what libraries exist for X then I suggest using the wget command for Unix

If you are a more adventurous Perl user check out the libwww library for Perl.

Q. Can I call the server manually for each query ?

Ans. Although the project is meant for you to use an automatic handling of html source (like wget followed by some parsing of the file) you can do it manually, saving the results to a file that you will later process. Note that you will have to create (at least) 25 such files, one per query.

Q.My understanding of the vector space model is that information is needed on the term frequency for every term in the corpus. I'm getting this from slides 8 and 9 of the 4th set of class slides, where the formula needs the length of the document vector, which is the square root of the term frequency for each term in the document squared, including for terms not in the query.
I can't find any easy way to get this information from lemur. For a query term like 'the', if I open and parse every single document to get the tf of each term in the document to fill in the vector space, then I calculate that a single query will take nearly 17 days to complete. I hope that's not right.
Is there a way we can reduce the document vector to unit length without having to fill in the entire vector for the document .. ie, without having to know the frequency of each term that appears in the document? Or is there a way to get this from lemur?

Ans. You could just compute the numerator and rank by that, as suggested in the Project statment. Also, see below for document length and document square lengths.

Q. I just started to do some basic work on the Project. I just wrote a couple of Java files reading the inverted lists over the provided CGI interface.
Two things came up:
- The output-format in the "P"-Mode is really weird: There's a couple of HTML-Code added without any reason. The Content-Type of the document is text/html. However, I managed to parse the List, but a comma-separated list without any markup code would have beed easier to use smile
- A bigger problem is Lemur's speed: It is really very slow if I try to read a inverted list that is huge, especially when I pass over a stopword. Testing 25 queries will take hours to complete with this bottleneck.

Ans: I fixed the first problem for the 'c' option. Ill soon do for the others. We'll set up more servers for the cgi and hopefully you guys should be ok.

Q. Many of the TREC queries involve 'or' and/or 'and'. Should we explicitly model these as operators in our programs?

Ans: You could implement that, but its not expected. Simple bag of words is fine.

Q. What is the output format for the search engine?

Ans.The standard output is as follows:

[qid] Q0 [DocID] [rank] [score] Exp

qid=query id
Doc ID =document ID
rank= rank of this document for this query.
score = score of your retrieval algorithm

Q0 and Exp are constants you have to output on each line (dont worry about why this is the case) for the evaluation script to run. The results file should be sorted in order of increasing qids, and for each qid in the order of increasing rank and therefore decreasing score. Here is a sample results file for another corpus and another set of queries.

Below are files which give document lengths (in words) as well as the sum of squares "lengths", where the latter is the square of the vector magnitude, useful in computing the true cosine similarity metric.

Database 0: d0.doclen

Database 1: d1.doclen

Database 2: d2.doclen

Database 3: d3.doclen