If you've never called a URL from the language X of your choice (C/C++/Perl etc) and you are wondering how you are ever going to get past the first hurdle of figuring out what libraries exist for X then I suggest using the wget command for Unix
If you are a more adventurous Perl user check out the libwww library for Perl.
Q. Can I call the server manually for each query ?
Ans. Although the project is meant for you to use an automatic handling of html source (like wget followed by some parsing of the file) you can do it manually, saving the results to a file that you will later process. Note that you will have to create (at least) 25 such files, one per query.
Q.My understanding of the vector
space model is that information is needed on the term frequency for every
term in the corpus. I'm getting this from slides 8 and 9 of the 4th set of
class slides, where the formula needs the length of the document vector,
which is the square root of the term frequency for each term in the document
squared, including for terms not in the query.
I can't find any easy way to get this information from lemur. For a query
term like 'the', if I open and parse every single document to get the tf of
each term in the document to fill in the vector space, then I calculate that
a single query will take nearly 17 days to complete. I hope that's not right.
Is there a way we can reduce the document vector to unit length without
having to fill in the entire vector for the document .. ie, without having to
know the frequency of each term that appears in the document? Or is there a
way to get this from lemur?
Ans. You could just compute the numerator and rank by that, as suggested in the Project statment. Also, see below for document length and document square lengths.
Q. I just started to do some
basic work on the Project. I just wrote a couple of Java files reading the
inverted lists over the provided CGI interface.
Two things came up:
- The output-format in the "P"-Mode is really weird: There's a couple of
HTML-Code added without any reason. The Content-Type of the document is
text/html. However, I managed to parse the List, but a comma-separated list
without any markup code would have beed easier to use smile
- A bigger problem is Lemur's speed: It is really very slow if I try to read
a inverted list that is huge, especially when I pass over a stopword. Testing
25 queries will take hours to complete with this bottleneck.
Ans: I fixed the first problem for the 'c' option. Ill soon do for the others. We'll set up more servers for the cgi and hopefully you guys should be ok.
Q. Many of the TREC queries involve 'or' and/or 'and'. Should we explicitly model these as operators in our programs?
Ans: You could implement that, but its not expected. Simple bag of words is fine.
Q. What is the output format for the search engine?
Ans.The standard output is as follows:
[qid] Q0 [DocID] [rank] [score] Exp
qid=query id
Doc ID =document ID
rank= rank of this document for this query.
score = score of your retrieval algorithm
Q0 and Exp are constants you have to output on each line (dont worry about why this is the case) for the evaluation script to run. The results file should be sorted in order of increasing qids, and for each qid in the order of increasing rank and therefore decreasing score. Here is a sample results file for another corpus and another set of queries.
Below are files which give document lengths (in words) as well as the sum of squares "lengths", where the latter is the square of the vector magnitude, useful in computing the true cosine similarity metric.
Database 0: d0.doclen
Database 1: d1.doclen
Database 2: d2.doclen
Database 3: d3.doclen