CS6200/IS4200: Information Retrieval

Course Project

Due: Tuesday, 10 December 2019, 11:59 p.m.

For the course project in CS6200, you will form teams of two to four people. The division of labor in the group will be described below.

If you are in IS4200, please contact me as soon as possible to discuss a project.

Although information retrieval systems perform well on many tasks, there is still plenty of room for new systems and applications. Your job will be to take the first steps in defining and evaluating a new task. The purpose of this project is thus to emphasize the centrality of problem definition and evaluation to information retrieval.

What kinds of tasks might you consider? In class, we have mostly discussed ad-hoc retrieval: the ranking of documents in response to user queries previously unseen by the search engine. For this project, you may consider either ad-hoc retrieval with paticular types of queries that current search engines do not seem to handle well, or other search tasks that involve other output modalities. These outputs might include document summaries, structured data, or clusters. It would be preferable if there were to be more than one correct answer for a given query. In other words, put more of your emphasis on information retrieval than on, e.g., question answering or summarization.

Here are some examples of tasks you might investigate:

Retrieving results for controversial topics, where result documents disagree with each other: You could organize documents by sentiment/polarity, or by clustering different opinions. Evaluation would be by both relevance are consistency within clusters.
Retrieving multiple aspects of a complex topic: Organize results by the subtopic they address. The TReC Complex Answer Retrieval track simulates this by using the section heading in Wikipedia articles. You could start with similar structured documents, or formulate different criteria for subtopics.
Retrieving evidence for statements in other documents, e.g, the recent Wikipedia-Internet Archive announcement
Retrieving quotes and opinions of a particular person, as attributed to them in news articles, social media, books, etc.
Mixed-language retrieval, where documents in multiple languages are ranked in a single list
Video scene or gif clip retrieval, with queries on character names, actor names, action descriptions, dialogue, captions, etc.

The core of this project will be creating an evaluation set for the proposed task. Procedurally, creating the evaluation corpus will proceed as follows:

Each team member in turn creates 10 queries
The other team members use existing search engines, manual browsing, and other knowledge sources to construct a set of results for each query that they, themselves, did not pose.
Each team member evaluates the results for their own queries. The evaluation metrics used will depend on the task. For document retrieval, binary relevance judgments or relevance judgements on a scale of 0 to 4 are fine. For structured output, one might use accuracy; for summarization, one might use 10-point scales.
For each query, the person who created it should also write a description of three or four sentences that discusses what factors were used to judge relevance.

The evaluation set will therefore consist of a number of records, one for each query. These records are conventionally called topics in IR evaluation. Each topic contains:

a short query (“title”) that a user would formulate to request information on the topic;
a longer text (“description”) that describes the annotator's criteria when judging relevance; and
a set of results with a relevance score for each one.

On the basis of these relevance judgments, you should be able to estimate human performance on your chosen task. You should also evaluate a baseline model, which will give you an idea of how much progress still needs to be made on the task. These baseline models do not need to be complex. You should evaluate baseline models from the evaluation set alone, without indexing and searching an entire collection. Instead, you should evaluate the baseline model on a reranking task: apply the model to the query and each candidate result in the evaluation set in turn. Then rank the candidate results by the model's score and evaluate this ranking by comparison to human judgements.

In your final submission, please include:

the evaluation set as described above; and
a short report describing your motivation for choosing your task, your estimate of human performance, and baseline performance.

Extra Credit

Implement a more specialized model to solve your task. Evaluate its performance compared to the baseline and to human performance.