Return to basic course information.
Assigned: Monday, September 15
Due: Email TAs with subject "CS6200 HW1" by Friday, September 26, 11:59 p.m.
[5 points] Document filtering is an application that stores a large number of queries or user profiles and compares these profiles to every incoming document on a feed. Documents that are sufficiently similar to the profile are forwarded to that person via email or some other mechanism.
Explain the major differences compared to a search engine. Consider issues such as specific efficiency problems and the usefulness of ranking in a filtering application.
Implement your own web crawler, with the following properties:
http://en.wikipedia.org/wiki/Gerard_Salton
,
the Wikipedia article on Gerald Salton, an important early
researcher in information retrieval.http://en.wikipedia.org/wiki/
. In other
words, do not follow links to non-English articles or to
non-Wikipedia pages.http://en.wikipedia.org/wiki/Main_Page
.Hand in your code and instructions on how to (compile and) run it. In addition, hand in two lists of URLs:
What proportion of the total pages were retrieved by the focused crawler for "information retrieval"? Keep in mind that this will be a significant overestimate of the prevalence of Wikipedia articles on information retrieval.