Assigned:
Wed 04 Apr 2012
Due:
Fri 20 Apr 2012
Notes:
The .tar.gz version from the "Search Engines" book site has the advantage that there is one file per CACM document, but there is some HTML formatting. The Glasgow version has one large file with all the documents concatenated (and no HTML). This may be an advantage as well, depending on how you decide to parse the data.
You will note that the CACM "documents" are actually just abstracts of full articles, and, in many cases, not much more than titles.
Finally, you can ignore the columns of numbers. This is an encoding of bibliographic references, I believe. Just index the text.
The inverted index file constitutes the bulk of the index. For simplicity, you can build up to this file in stages:
As with Project 1, experiment with various retrieval formulae parameters, such as the smoothing parameter, and compare and contrast your results here with those from Project 1: Do the same retrieval formulae work best? Do the optimal retrieval formulae parameters change? And so on.