important Submit by committing your files to svn. See the checklist below to make sure you turn everything in.
In this project, you will use the index you created in Project One to rank documents and create a search engine. You will implement several different scoring functions and compare their results against a baseline ranking produced by expert analysts. You will then write a short report to discuss your results.
Write a 2-5 page report providing the following information.
For this assignment, you will need the following two files (your CCIS password is required):
<query>
elements. The <description>
elements are only there to clarify the information need which the query is trying to express.
Note that even with this clarification, the information needs are still somewhat ambiguous. For instance, what does "biggest dog" mean for query 230? The tallest? The heaviest? Do fictional dogs count? The ambiguities are part of what makes information retrieval difficult.
The format here is:
<topic> 0 <docid> <grade>
<topic>
is the ID of the query for which the document was assessed.0
is part of the format and can be ignored.<docid>
is the name of one of the documents which you have indexed.<grade>
is a value in the set {-2, 0, 1, 2, 3, 4}, where a higher value means that the document is more relevant to the query. The value -2 indicates a spam document, and 0 indicates a non-spam document which is completely non-relevant. Most queries do not have any document with a grade of 4, and many queries do not have any document with a grade of 3. This is a consequence of the specific meaning assigned to these grades here and the manner in which the documents were collected.This QREL does not have assessments for every (query, document) pair. If an assessment is missing, we assume the correct grade for the pair is 0 (non-relevant).
You will write a program which takes the name of a scoring function as a command line argument and which prints to STDOUT a ranked list of documents for all queries found in topics.xml using that scoring function. For example:
$ ./query.py --score TF-IDF
202 0 clueweb12-0000tw-13-04988 1 0.73 run1
202 0 clueweb12-0000tw-13-04901 2 0.33 run1
202 0 clueweb12-0000tw-13-04932 3 0.32 run1
...
214 0 clueweb12-0000tw-13-05088 1 0.73 run1
214 0 clueweb12-0000tw-13-05001 2 0.33 run1
214 0 clueweb12-0000tw-13-05032 3 0.32 run1
...
250 0 clueweb12-0000tw-13-05032 500 0.002 run1
The possible values for the --score
parameter are defined below. The output should have one row for each document which your program ranks for each query it runs. These lines should have the format:
<topic> 0 <docid> <rank> <score> <run>
<topic>
is the ID of the query for which the document was ranked.0
is part of the format and can be ignored.<docid>
is the document identifier.<rank>
is the order in which to present the document to the user. The document with the highest score will be assigned a rank of 1, the second highest a rank of 2, and so on.<score>
is the actual score the document obtained for that query.<run>
is the name of the run. You can use any value here. It is meant to allow research teams to submit multiple runs for evaluation in competitions such as TREC.Before running any scoring function, you should process the text of the query in exactly the same way that you processed the text of a document. That is:
To evaluate your results, we will use a modified form of average precision called Graded Average Precision, or GAP. (Out of our own IR lab. The paper introducing the metric is here, if you're curious.) You should download the script here and use it for evaluation. In order to evaluate a run, redirect the output of your program to a file (which I'll call run.txt) and run gap.py on it:
$ ./query.py --score TF-IDF > run.txt
$ ./gap.py corpus.qrel run.txt -v
The --score
parameter should take one of the following values, indicating how to assign a score to a document for a query.
The parameter --score TF
directs your program to use a vector space model with term frequency scores. We will use a slightly modified form of the basic term frequency scores known as Okapi TF, or Robertson's TF. In a vector space model, a query or a document is (conceptually, not literally) represented as a vector with a term score for each term in the vocabulary. Using term frequency scores, the component of the document vector for term \(i\) is:
$$ d_i = oktf(d, i) $$
$$ oktf(d, i) = \frac{tf(d, i)}{tf(d, i) + 0.5 + 1.5 \cdot (len(d) / avg(len(d)))} $$
where \(tf(d, i)\) is the number of occurrences of term \(i\) in document (or query) \(d\), \(len(d)\) is the number of terms in document \(d\), and \(avg(len(d))\) is the average document length taken over all documents. You should assign a ranking score to documents by calculating their cosine similarity to the query's vector representation. That is,
$$ score(d) = \frac{ \vec{d} \cdot \vec{q} }{ \|\vec{d}\| \cdot \|\vec{q}\| } $$
where \(\|\vec{x}\| = \sqrt{\sum_i x_i^2} \) is the norm of vector \(\vec{x}\).
The parameter --score TF-IDF
directs your program to use a vector space model with TF-IDF scores. This should be very similar to the TF score, but use the following scoring function:
$$ d_i = oktf(d, i) \cdot \log{ \frac{D}{df(i)} } $$
where \(D\) is the total number of documents, and \(df(i)\) is the number of documents which contain term \(i\).
The parameter --score BM25
directs your program to use BM25 scores. This should use the following scoring function for document \(d\) and query \(q\):
$$ score(d, q) = \sum_{i \in q} \left[ \log \left( \frac{ D + 0.5 }{ df(i) + 0.5 } \right) \cdot \frac{(1 + k_1) \cdot tf(d, i)}{ K + tf(d, i) } \cdot \frac{(1 + k_2) \cdot tf(q, i)}{ k_2 + tf(q, i) } \right] $$
$$ K = k_1 \cdot \left( (1 - b) + b \cdot \frac{len(d)}{avg(len(d))} \right) $$
Where \(k_1, k_2, \) and \(b\) are constants. For starters, you can use the values suggested in the lecture on BM25. Feel free to experiment with different values for these constants to learn their effect and try to improve performance.
The parameter --score Laplace
directs your program to use a language model with Laplace smoothing (a.k.a. "add-one" smoothing). To be more specific, we will use maximum likelihood estimates of the query based on a multinomial model "trained" on the document. In this setup, you can consider just the terms which actually appear in the query. Here is the scoring function.
$$ score(d, q) = \sum_{i \in q} \log p_d(i) $$
$$ p_d(i) = \frac{tf(d, i) + 1}{len(d) + V} $$
where \(V\) is the vocabulary size: the number of unique terms in the corpus. This gives an estimate of the likelihood that the language model which generated the document could also generate the query. The \(+1\) and \(+V\) smooth the probability out so that we don't assign probability zero to terms which did not appear in the document.
The parameter --score JM
directs your program to use a language model with Jelinek-Mercer smoothing. This uses the same scoring function and the same language model, but a slightly different smoothing function. In this case we use
$$ p_d(i) = \lambda \frac{tf(d, i)}{len(d)} + (1-\lambda) \frac{\sum_d tf(d, i)}{\sum_d len(d)} $$
The parameter \(\lambda \in [0, 1] \) allows us to mix the probability from a model trained on document \(d\) with the probability from a model trained on all documents. You can use \(\lambda=0.2\). You are encouraged, but not required, to play with the value of \(\lambda\) to see its effect. (What do you expect would happen with \(\lambda=0\) or \(\lambda=1\)?)
Note that you can calculate the second term using "the total number of occurrences of the term in the entire corpus" (the cumulative term frequency, or ctf) which you have stored in term_info.txt.
Submit your files in a folder named pr2
.