CS6200: Project 2

CS6200 Project Two

Assigned:: Wednesday, March 12, 2014
Due:: 10:00pm, Tuesday, April 1, 2014

important Submit by committing your files to svn. See the checklist below to make sure you turn everything in.

Corrections

The BM25 formula has been updated: you should only take the log of the first term. This term should never be zero, so your program should never take log(0).
The BM25 formula has been updated: the third term should have $k_2$ in the denominator rather than $K$.
The Jelinek-Mercer formula has been updated: the second term should have separate summations in the numerator and the denominator, and my original note was correct.
Note: it is irrelevant whether you multiply the first term by $\lambda$ and the second by $1-\lambda$ or vice versa. All this changes is whether to set $\lambda$ to (for instance) 0.2 or 0.8.

Overview

In this project, you will use the index you created in Project One to rank documents and create a search engine. You will implement several different scoring functions and compare their results against a baseline ranking produced by expert analysts. You will then write a short report to discuss your results.

Report

Write a 2-5 page report providing the following information.

Before you write any code, read this entire assignment and think about the queries and the different scoring functions. Write a paragraph describing how you expect each scoring function will perform and why. It's OK if you're not sure, and it's OK if you don't get it right. This portion of the report will only be graded based on whether it's there or not and whether it has some meaningful content. The idea is just to start thinking about what's unique about each method and how that might affect their results.
Describe how each method actually performed on the queries. Was there a method which was significantly better than the others? Was there a query which was significantly harder than the others? Explain why you got the results you did, and how that compares to what you expected to happen before implementing the rankers. Include a table of the GAP score for each scoring function on each query, and the average GAP score of each scoring function across all queries.
(Extra credit) Formulate alternative versions of each query which produce better GAP scores for at least two scoring functions. Submit your alternative topics.xml file, include a table of the improved GAP scores in your report, and explain why you think these queries produce better scores. If they don't improve the GAP scores for all the scoring functions, try to explain why. In order to qualify, your replacement queries must be no more than ten words long. Extra credit will be assigned on a query-by-query basis, but your analysis need not explain all ten queries in detail.

Running Queries

For this assignment, you will need the following two files (your CCIS password is required):

topics.xml contains the queries you will be testing. You should run the queries using the text stored in the <query> elements. The <description> elements are only there to clarify the information need which the query is trying to express.
Note that even with this clarification, the information needs are still somewhat ambiguous. For instance, what does "biggest dog" mean for query 230? The tallest? The heaviest? Do fictional dogs count? The ambiguities are part of what makes information retrieval difficult.
corpus.qrel contains the relevance grades from expert assessors. While these grades are not necessarily entirely correct (and defining correctness unambiguously is quite difficult), they are fairly reliable and we will treat them as being correct here.
The format here is:

<topic> 0 <docid> <grade>
- <topic> is the ID of the query for which the document was assessed.
- 0 is part of the format and can be ignored.
- <docid> is the name of one of the documents which you have indexed.
- <grade> is a value in the set {-2, 0, 1, 2, 3, 4}, where a higher value means that the document is more relevant to the query. The value -2 indicates a spam document, and 0 indicates a non-spam document which is completely non-relevant. Most queries do not have any document with a grade of 4, and many queries do not have any document with a grade of 3. This is a consequence of the specific meaning assigned to these grades here and the manner in which the documents were collected.
This QREL does not have assessments for every (query, document) pair. If an assessment is missing, we assume the correct grade for the pair is 0 (non-relevant).

You will write a program which takes the name of a scoring function as a command line argument and which prints to STDOUT a ranked list of documents for all queries found in topics.xml using that scoring function. For example:


            $ ./query.py --score TF-IDF

            202 0 clueweb12-0000tw-13-04988 1 0.73 run1

            202 0 clueweb12-0000tw-13-04901 2 0.33 run1

            202 0 clueweb12-0000tw-13-04932 3 0.32 run1

            ...

            214 0 clueweb12-0000tw-13-05088 1 0.73 run1

            214 0 clueweb12-0000tw-13-05001 2 0.33 run1

            214 0 clueweb12-0000tw-13-05032 3 0.32 run1

            ...

            250 0 clueweb12-0000tw-13-05032 500 0.002 run1

The possible values for the --score parameter are defined below. The output should have one row for each document which your program ranks for each query it runs. These lines should have the format:

<topic> 0 <docid> <rank> <score> <run>

<topic> is the ID of the query for which the document was ranked.
0 is part of the format and can be ignored.
<docid> is the document identifier.
<rank> is the order in which to present the document to the user. The document with the highest score will be assigned a rank of 1, the second highest a rank of 2, and so on.
<score> is the actual score the document obtained for that query.
<run> is the name of the run. You can use any value here. It is meant to allow research teams to submit multiple runs for evaluation in competitions such as TREC.

Query Processing

Before running any scoring function, you should process the text of the query in exactly the same way that you processed the text of a document. That is:

Split the query into tokens (it is most correct to use the regular expression, but for these queries it suffices to split on whitespace)
Convert all tokens to lowercase
Apply stop-wording to the query using the same list you used in project 1
Apply the same stemming algorithm to the query which you used in your indexer

Evaluation

To evaluate your results, we will use a modified form of average precision called Graded Average Precision, or GAP. (Out of our own IR lab. The paper introducing the metric is here, if you're curious.) You should download the script here and use it for evaluation. In order to evaluate a run, redirect the output of your program to a file (which I'll call run.txt) and run gap.py on it:


                $ ./query.py --score TF-IDF > run.txt

                $ ./gap.py corpus.qrel run.txt -v

The program will output the individual GAP score for each query in run.txt, and also the mean GAP score for all queries.

Scoring Functions

The --score parameter should take one of the following values, indicating how to assign a score to a document for a query.

Scoring Function 1: Okapi TF

The parameter --score TF directs your program to use a vector space model with term frequency scores. We will use a slightly modified form of the basic term frequency scores known as Okapi TF, or Robertson's TF. In a vector space model, a query or a document is (conceptually, not literally) represented as a vector with a term score for each term in the vocabulary. Using term frequency scores, the component of the document vector for term $i$ is: $$ d_i = oktf(d, i) $$ $$ oktf(d, i) = \frac{tf(d, i)}{tf(d, i) + 0.5 + 1.5 \cdot (len(d) / avg(len(d)))} $$ where $tf(d, i)$ is the number of occurrences of term $i$ in document (or query) $d$, $len(d)$ is the number of terms in document $d$, and $avg(len(d))$ is the average document length taken over all documents. You should assign a ranking score to documents by calculating their cosine similarity to the query's vector representation. That is, $$ score(d) = \frac{ \vec{d} \cdot \vec{q} }{ \|\vec{d}\| \cdot \|\vec{q}\| } $$ where $\|\vec{x}\| = \sqrt{\sum_i x_i^2} $ is the norm of vector $\vec{x}$.

Scoring Function 2: TF-IDF

The parameter --score TF-IDF directs your program to use a vector space model with TF-IDF scores. This should be very similar to the TF score, but use the following scoring function: $$ d_i = oktf(d, i) \cdot \log{ \frac{D}{df(i)} } $$ where $D$ is the total number of documents, and $df(i)$ is the number of documents which contain term $i$.

Scoring Function 3: Okapi BM25

The parameter --score BM25 directs your program to use BM25 scores. This should use the following scoring function for document $d$ and query $q$: $$ score(d, q) = \sum_{i \in q} \left[ \log \left( \frac{ D + 0.5 }{ df(i) + 0.5 } \right) \cdot \frac{(1 + k_1) \cdot tf(d, i)}{ K + tf(d, i) } \cdot \frac{(1 + k_2) \cdot tf(q, i)}{ k_2 + tf(q, i) } \right] $$ $$ K = k_1 \cdot \left( (1 - b) + b \cdot \frac{len(d)}{avg(len(d))} \right) $$ Where $k_1, k_2, $ and $b$ are constants. For starters, you can use the values suggested in the lecture on BM25. Feel free to experiment with different values for these constants to learn their effect and try to improve performance.

Scoring Function 4: Language model with Laplace Smoothing

The parameter --score Laplace directs your program to use a language model with Laplace smoothing (a.k.a. "add-one" smoothing). To be more specific, we will use maximum likelihood estimates of the query based on a multinomial model "trained" on the document. In this setup, you can consider just the terms which actually appear in the query. Here is the scoring function. $$ score(d, q) = \sum_{i \in q} \log p_d(i) $$ $$ p_d(i) = \frac{tf(d, i) + 1}{len(d) + V} $$ where $V$ is the vocabulary size: the number of unique terms in the corpus. This gives an estimate of the likelihood that the language model which generated the document could also generate the query. The $+1$ and $+V$ smooth the probability out so that we don't assign probability zero to terms which did not appear in the document.

Scoring Function 5: Language model with Jelinek-Mercer Smoothing

The parameter --score JM directs your program to use a language model with Jelinek-Mercer smoothing. This uses the same scoring function and the same language model, but a slightly different smoothing function. In this case we use $$ p_d(i) = \lambda \frac{tf(d, i)}{len(d)} + (1-\lambda) \frac{\sum_d tf(d, i)}{\sum_d len(d)} $$ The parameter $\lambda \in [0, 1] $ allows us to mix the probability from a model trained on document $d$ with the probability from a model trained on all documents. You can use $\lambda=0.2$. You are encouraged, but not required, to play with the value of $\lambda$ to see its effect. (What do you expect would happen with $\lambda=0$ or $\lambda=1$?)

Note that you can calculate the second term using "the total number of occurrences of the term in the entire corpus" (the cumulative term frequency, or ctf) which you have stored in term_info.txt.

Submission Checklist

Submit your files in a folder named pr2.

Your report
Your source code
The output ranking for each scoring function (zipped or gzipped)
Your alternative topics.xml, if you attempted the extra credit

Rubric

Scoring function implementation: 50 Points

10 points:
You correctly implemented Okapi TF

10 points:
You correctly implemented TF-IDF

10 points:
You correctly implemented Okapi BM25

10 points:
You correctly implemented a language model with Laplace smoothing

10 points:
You correctly implemented a language model with Jelinek-Mercer smoothing
Report: 50 Points (+10)

10 points:
You wrote your expectations down in advance

10 points:
You included the appropriate data tables

20 points:
You described your results adequately (4 points for each scoring function)

10 points:
Your analysis shows a good understanding of the scoring functions

+10 points:
You provided better-performing queries (+1 for each query)