CS6200: Homework 1

Problem One

Imagine you work for a social network. Your users will post text entries on the network and will read entries that other users have written. Your job is to design the component that suggests entries for a user to read whenever the user visits their main page on the network. To accomplish this, design a system which treats the user's entries as queries and finds the entries that are most relevant to the user's own entries. For your assignment, you will need to do the following.

Create a block diagram in the style used in the second lecture of the major components of your system and their relationships.
Describe your algorithm using a flow chart, pseudocode, or similar approach. Provide enough detail that a good engineer with no IR background would understand it.
Write an explanation of your approach in no more than one page. You will be graded in part on the neatness of your submission and the level of professionalism in your writing. Your explanation should address (at least) the following issues:
- How is this system different from a search engine?
- Is ranking important for your system? Why or why not?
- Which parts of your system are vulnerable to scalability issues, and how do you address that?
- Do you need to worry about spam? If so, what can you do about it? If not, why not?
- How does your system ensure a good user experience?

Problem Two

Write a web crawling program in the language of your choice. Your program must meet the criteria described below. Also, write a report of up to half a page describing your crawling approach. What are your crawler's strengths and vulnerabilities? Would you use this program in a production setting? What problems would it face, and how could it be improved? While you don't need to go "above and beyond" the criteria below, you are encouraged to think about what that might mean.

Program Specification

Start at the URL http://www.ccs.neu.edu and visit only web pages in the neu.edu and northeastern.edu domains.
Stop once you have visited 100 HTML pages.
Detect whether a page you visit is an HTML document, a PDF, or something else using the appropriate Content-Type HTTP header.
Detect all outgoing links on each HTML page you visit, and follow only links to pages you have not yet visited. This will require you to convert some links to a canonical form, e.g. http://somepage.com/my_page.html# should be converted to http://somepage.com/my_page.html.
For every web page you visit, write its canonical URL and the canonical URLs of each outgoing link to an HTML or PDF file, separated by a single space, to a plain text output file, with one line per visited page. Here is a sample file in the correct format. If your output is correct, the graph of the pages you've visited should render correctly here.
Put some thought into the pages you decide to visit, and the order in which you visit them. Is there any way links on web pages could cause your crawler to misbehave? Remember that a crawler on the open web will face many novice and malicious web developers.
Your crawler must visit at most one page per five seconds. Under no circumstances, even during development and testing, should your crawler visit pages more frequently. This is necessary in order to be respectful of the limited resources of the web servers you are visiting. (Feel free during development to reduce the total number of pages you visit if this is slowing you down.)
Your crawler must respect robots.txt.

Files and Directories

Your source code should be in subversion in the folder hw1/java or hw1/python or, more generally, hw1/<language>. You may create whatever files or folders beneath that level you wish.
Your program can be named anything you like, and take any arguments you want. However, you must provide a shell script at hw1/run.sh which compiles your program, if necessary, and then runs it. A sample script for a python program is:
#!/bin/bash python crawl.py --root=http://www.ccs.neu.edu
Do not commit any binaries for your source code. Your program must be compiled by run.sh.

Use of external libraries

You may use existing libraries to download web pages and to parse HTML. You must write your own code to manage the list of visited URLs and to write your site graph.
Java: Add any external dependencies as .jar files to hw1/java/jar and make sure your classpath is set appropriately in run.sh
Python: You must use pip-installable libraries. Create a second shell script at hw1/pip.sh, which runs pip install for your dependencies. Do not assume that any external library will already be installed.
Other languages: Adapt the above policies for your language in some reasonable way. If you're not sure what to do, feel free to ask.

Submission Checklist

Problem 1: The diagram
Problem 1: The flowchart/algorithm
Problem 1: Your explanation
Problem 2: Your source code
Problem 2: Your list of links
Problem 2: Your report
Problem 2: run.sh
Problem 2: (Java only) .jar dependencies
Problem 2: (Python only) pip.sh

Rubric

Problem 1: 50 Points

10 points:
Diagram contains all necessary components, and relationships make sense.

10 points:
Algorithm flowchart/pseudocode is clear, and accomplishes the task correctly.

20 points:
Writeup explains the approach well and addresses all concerns stated in the assignment.

10 points:
Presentation and writing quality. Diagrams and pseudocode are clear, and written explanation takes no more than one page.
Problem 2: 50 Points

10 points:
You respect robots.txt, have a 5 second delay between fetching pages, and don't follow URLs to outside domains

20 points:
You crawl the correct pages in some reasonable order and produce a working link graph.

20 points:
Your report is thoughtful and well-organized.