Created: Wed 20 Sep 2006
Last modified:
Q : Why is http://kerf.ccs.neu.edu/ not working ? [obsolete]
IMPORTANT : from "community ports" http://kerf.ccs.neu.edu/ is not accessible (we are working on this). Instead use the NAT http://10.0.0.204/ . However, this may be a permanent problem; as a result, if u want your code to work both from community ports and from other net outlets(including outside CCIS network) you should :
use a variable for the server name, that can be changed easily
detect from code if your machine is attached to a community port or not (if you go through NOMAD you are on a community port); alternatively you can change the server name manually when you start a new session.
If you've never called a URL from the language X of your choice (C/C++/Perl etc) and you are wondering how you are ever going to get past the first hurdle of figuring out what libraries exist for X then I suggest using the wget command for Unix
If you are a more adventurous Perl user check out the libwww library for Perl.
Q. Can I call the server manually for each query ?
Ans. Although the project is meant for you to use an automatic handling of html source (like wget followed by some parsing of the file) you can do it manually (not recommended) : there are 4 systems x 4 databases x 25 queries = 400 http calls, if you manage to call the server only once per query. You will have to copy-paste the statistics from the browser creating 400 files used afterwards to rank the documents. If you choose this option please state clearly "MANUAL HTTP CALLS" on top of the first page of the report. Note that this will result in a penalty of 10% of the grade (ie maximum obtainable are 90 points).
Q.My understanding of the
vector space model is that information is needed on the term
frequency for every term in the corpus. I'm getting this from slides
8 and 9 of the 4th set of class slides, where the formula needs the
length of the document vector, which is the square root of the term
frequency for each term in the document squared, including for terms
not in the query.
I can't find any easy way to get this
information from lemur. For a query term like 'the', if I open and
parse every single document to get the tf of each term in the
document to fill in the vector space, then I calculate that a single
query will take nearly 17 days to complete. I hope that's not right.
Is there a way we can reduce the document vector to unit length
without having to fill in the entire vector for the document .. ie,
without having to know the frequency of each term that appears in the
document? Or is there a way to get this from lemur?
Ans. So you could just compute the numerator and rank by that. I'll see if I can provide a normalized inverted list.
Q. I just started to do
some basic work on the Project. I just wrote a couple of Java files
reading the inverted lists over the provided CGI interface.
Two
things came up:
- The output-format in the "P"-Mode is
really weird: There's a couple of HTML-Code added without any reason.
The Content-Type of the document is text/html. However, I managed to
parse the List, but a comma-separated list without any markup code
would have beed easier to use smile
- A bigger problem is Lemur's
speed: It is really very slow if I try to read a inverted list that
is huge, especially when I pass over a stopword. Testing 50 queries
will take hours to complete with this bottleneck.
Ans: I fixed the first problem for the 'c' option. Ill soon do for the others. We'll set up more servers for the cgi and hopefully you guys should be ok.
Q. Many of the TREC queries involve 'or' and/or 'and'. Should we explicitly model these as operators in our programs?
Ans: You could implement that, but its not expected. Simple bag of words is fine.
Q. What is the output format for the search engine?
Ans.The standard output is as follows:
[qid] Q0 [DocID] [rank] [score] Exp
qid=query id
Doc ID
=document ID
rank= rank of this document for this query.
score
= score of your retrieval algorithm
Q0 and Exp are constants you have to output on each line (dont worry about why this is the case) for the evaluation script to run. The results file should be sorted in order of increasing qids, and for each qid in the order of increasing rank and therefore decreasing score. Here is a sample results file for another corpus and another set of queries.
Documents and Square of the Document lengths for each of the databases.
Database 0: d=0.doclen
Database 1: d=1.doclen
Database 2: d=2.doclen
Database 3: d=3.doclen