Return to basic course information.
Assigned: Thursday, 20 November 2014
Due: Tuesday, 2 December 2014, 11:59 p.m.
Implement a small search engine. The main steps are:
tccorpus.txt
inside tccorpus.zip
.
This is an early standard collection of abstracts from
the Communications of the ACM.indexer tccorpus.txt index.out
bm25 index.out queries.txt 100 > results.eval
query_id Q0 doc_id rank BM25_score system_nameThe string
Q0
is a literal used by the evaluation script. You can use any space-free token for your system_name
.
Tokenized Document Collection
tccorpus.txt
file is in the format:
# 1 this is a tokenzied line for document 1 this is also a line of document 1 # 2 from here lines for document 2 begin ... ... # 3 ...
Building an Inverted Index: The following data structures are required for BM25 computation:
word -> (docid, tf), (docid, tf), ...
BM25 Ranking
Test Queries: Use the following stemmed test queries, also provided in the file queries.txt
:
Query ID | Query Text |
---|---|
1 | portabl oper system |
2 | code optim for space effici |
3 | parallel algorithm |
4 | distribut comput structur and algorithm |
5 | appli stochast process |
6 | perform evalu and model of comput system |
7 | parallel processor in inform retriev |