IS4200/CS6200: Information Retrieval

Homework 1

Return to basic course information.

Assigned: Thursday, 13 September 2012
Due: Thursday, 27 September 2012


  1. This assignment is due at the beginning of class on the due date assigned above.
  2. If you collaborated with others, you must write down with whom you worked on the assignment. If this changes from problem to problem, then you should write down this information separately with each problem.
  3. Submit the requested written answers, code, and instructions to the TAs on how to (compile and) run the code.


  1. Document filtering is an application that stores a large number of queries or user profiles and compares these profiles to every incoming document on a feed. Documents that are sufficiently similar to the profile are forwarded to that person via email or some other mechanism.
    1. Describe the components of a filtering engine using a block diagram of the architecture, a flowchart of the filtering process, and text explaining the function of the components. Use the same level of detail that we gave in the second lectures. For instance, don't just say that the filter needs "text acquisition" but that it needs "format conversion" and "stemming", to name only one example.
    2. Explain the major differences compared to a search engine. Consider issues such as specific efficiency problems and the usefulness of ranking in a filtering application.
  2. Use the GNU wget utility to crawl the CIS college site, starting with the seed
  3. Implement your own crawler