important Submit by committing your files to svn. See the checklist below to make sure you turn everything in.
Imagine you work for a social network. Your users will post text entries on the network and will read entries that other users have written. Your job is to design the component that suggests entries for a user to read whenever the user visits their main page on the network. To accomplish this, design a system which treats the user's entries as queries and finds the entries that are most relevant to the user's own entries. For your assignment, you will need to do the following.
Write a web crawling program in the language of your choice. Your program must meet the criteria described below. Also, write a report of up to half a page describing your crawling approach. What are your crawler's strengths and vulnerabilities? Would you use this program in a production setting? What problems would it face, and how could it be improved? While you don't need to go "above and beyond" the criteria below, you are encouraged to think about what that might mean.
Content-Type
HTTP header.http://somepage.com/my_page.html#
should be converted to http://somepage.com/my_page.html
.hw1/java
or hw1/python
or, more generally, hw1/<language>
. You may create whatever files or folders beneath that level you wish.hw1/run.sh
which compiles your program, if necessary, and then runs it. A sample script for a python program is:
#!/bin/bash
python crawl.py --root=http://www.ccs.neu.edu
hw1/java/jar
and make sure your classpath is set appropriately in run.shhw1/pip.sh
, which runs pip install for your dependencies. Do not assume that any external library will already be installed.