CS6200 Homework One

Assigned:
Wednesday, January 15, 2014
Due:
10:00pm, Tuesday, January 21, 2014

important Submit by committing your files to svn. See the checklist below to make sure you turn everything in.

Problem One

Imagine you work for a social network. Your users will post text entries on the network and will read entries that other users have written. Your job is to design the component that suggests entries for a user to read whenever the user visits their main page on the network. To accomplish this, design a system which treats the user's entries as queries and finds the entries that are most relevant to the user's own entries. For your assignment, you will need to do the following.

  • Create a block diagram in the style used in the second lecture of the major components of your system and their relationships.
  • Describe your algorithm using a flow chart, pseudocode, or similar approach. Provide enough detail that a good engineer with no IR background would understand it.
  • Write an explanation of your approach in no more than one page. You will be graded in part on the neatness of your submission and the level of professionalism in your writing. Your explanation should address (at least) the following issues:
    • How is this system different from a search engine?
    • Is ranking important for your system? Why or why not?
    • Which parts of your system are vulnerable to scalability issues, and how do you address that?
    • Do you need to worry about spam? If so, what can you do about it? If not, why not?
    • How does your system ensure a good user experience?

Problem Two

Write a web crawling program in the language of your choice. Your program must meet the criteria described below. Also, write a report of up to half a page describing your crawling approach. What are your crawler's strengths and vulnerabilities? Would you use this program in a production setting? What problems would it face, and how could it be improved? While you don't need to go "above and beyond" the criteria below, you are encouraged to think about what that might mean.

Program Specification
  • Start at the URL http://www.ccs.neu.edu and visit only web pages in the neu.edu and northeastern.edu domains.
  • Stop once you have visited 100 HTML pages.
  • Detect whether a page you visit is an HTML document, a PDF, or something else using the appropriate Content-Type HTTP header.
  • Detect all outgoing links on each HTML page you visit, and follow only links to pages you have not yet visited. This will require you to convert some links to a canonical form, e.g. http://somepage.com/my_page.html# should be converted to http://somepage.com/my_page.html.
  • For every web page you visit, write its canonical URL and the canonical URLs of each outgoing link to an HTML or PDF file, separated by a single space, to a plain text output file, with one line per visited page. Here is a sample file in the correct format. If your output is correct, the graph of the pages you've visited should render correctly here.
  • Put some thought into the pages you decide to visit, and the order in which you visit them. Is there any way links on web pages could cause your crawler to misbehave? Remember that a crawler on the open web will face many novice and malicious web developers.
  • Your crawler must visit at most one page per five seconds. Under no circumstances, even during development and testing, should your crawler visit pages more frequently. This is necessary in order to be respectful of the limited resources of the web servers you are visiting. (Feel free during development to reduce the total number of pages you visit if this is slowing you down.)
  • Your crawler must respect robots.txt.
Files and Directories
  • Your source code should be in subversion in the folder hw1/java or hw1/python or, more generally, hw1/<language>. You may create whatever files or folders beneath that level you wish.
  • Your program can be named anything you like, and take any arguments you want. However, you must provide a shell script at hw1/run.sh which compiles your program, if necessary, and then runs it. A sample script for a python program is:
    #!/bin/bash
    python crawl.py --root=http://www.ccs.neu.edu
  • Do not commit any binaries for your source code. Your program must be compiled by run.sh.
Use of external libraries
  • You may use existing libraries to download web pages and to parse HTML. You must write your own code to manage the list of visited URLs and to write your site graph.
  • Java: Add any external dependencies as .jar files to hw1/java/jar and make sure your classpath is set appropriately in run.sh
  • Python: You must use pip-installable libraries. Create a second shell script at hw1/pip.sh, which runs pip install for your dependencies. Do not assume that any external library will already be installed.
  • Other languages: Adapt the above policies for your language in some reasonable way. If you're not sure what to do, feel free to ask.

Submission Checklist

Rubric