CS6200/IS4200: Information Retrieval

Homework 1

Return to basic course information.

Assigned: Wednesday, September 11
Due: Tuesday, September 24, 11:59 p.m.

Focused Crawling

Implement your own web crawler, with the following specifications:

Hand in your code and instructions on how to (compile and) run it in a README file. In addition, hand in two lists of URLs, each with at most 1000 entries:

  1. the pages crawled when the crawler is run with no keyphrase, in other words all Wikipedia pages meeting the requirements above to a depth of 5 from the starting seed; and
  2. the pages crawled when the keyphrase is ‘retrieval’.

Finally, include in your README the percentage of the pages in the full crawl that were retrieved by the focused crawler for ‘retrieval’. Keep in mind that this will be a significant overestimate of the prevalence of Wikipedia articles on information retrieval.