CS6200: Information Retrieval

Homework 4

Return to basic course information.

Assigned: Thursday, 20 November 2014
Due: Tuesday, 2 December 2014, 11:59 p.m.


Instructions

  1. If you collaborated with others, you must write down with whom you worked on the assignment. If this changes from problem to problem, then you should write down this information separately with each problem.
  2. Submit the requested written answers, code, and instructions via subversion.

Problems

  1. [10 points] BM25
    1. The BM25 retrieval model modifies the binary independence model by including term weighting. Explain what ‘binary independence’ means and what the ranking function for the binary independence model would be.
    2. The BM25 document term weighting component includes three parameters: $k_1$, $b$, and $K$, which is a function of the other two. (The $k_2$ parameter is used for query term weighting.) Explain in words the impact of varying the values of these parameters, and submit a graph of the term-weighting component of BM25 as a function of term frequency for $k_1 = 1$ and $K = 1$.
  2. [70 points] A Small Search Engine

    Implement a small search engine. The main steps are:

    Tokenized Document Collection

    Building an Inverted Index: The following data structures are required for BM25 computation:

    BM25 Ranking

    1. Retrieve all inverted lists corresponding to terms in a query.
    2. Compute BM25 scores for documents in the lists.
    3. Make a score list for documents in the inverted lists.
    4. Accumulate scores for each term in a query on the score list.
    5. Assume that no relevance information is available.
    6. For parameters, use $k_1 = 1.2$, $b = 0.75$, $k_2 = 100$.
    7. Sort the documents by the BM25 scores.

    Test Queries: Use the following stemmed test queries, also provided in the file queries.txt:
    Query IDQuery Text
    1portabl oper system
    2code optim for space effici
    3parallel algorithm
    4distribut comput structur and algorithm
    5appli stochast process
    6perform evalu and model of comput system
    7parallel processor in inform retriev