CS6200: Information Retrieval

Homework 1

Return to basic course information.

Assigned: Monday, September 15
Due: Email TAs with subject "CS6200 HW1" by Friday, September 26, 11:59 p.m.


Instructions

  1. If you collaborated with others, you must write down with whom you worked on the assignment. If this changes from problem to problem, then you should write down this information separately with each problem.
  2. Submit the requested written answers, code, and instructions to the TAs on how to (compile and) run the code.

Problems

  1. [5 points] Document filtering is an application that stores a large number of queries or user profiles and compares these profiles to every incoming document on a feed. Documents that are sufficiently similar to the profile are forwarded to that person via email or some other mechanism.

    Explain the major differences compared to a search engine. Consider issues such as specific efficiency problems and the usefulness of ranking in a filtering application.

  2. [25 points] Focused Crawling

    Implement your own web crawler, with the following properties:

    Hand in your code and instructions on how to (compile and) run it. In addition, hand in two lists of URLs:

    1. the pages crawled when the crawler is run with no keyphrase, in other words all Wikipedia pages meeting the requirements above to a depth of 3 from the starting seed; and
    2. the pages crawled when the keyphrase is "information retrieval".

    What proportion of the total pages were retrieved by the focused crawler for "information retrieval"? Keep in mind that this will be a significant overestimate of the prevalence of Wikipedia articles on information retrieval.