IS4200/CS6200 12S Project 02

Created: Tue 13 Mar 2012
Last modified: 

Assigned: Tue 13 Mar 2012
Due: Tue 27 Mar 2012


PageRank

In this project, you will compute PageRank on a collection of 183,811 web documents. Consider the version of PageRank described in class and in HW03. PageRank can be computed iteratively as show in the following pseudocode:
// P is the set of all pages; |P| = N
// S is the set of sink nodes, i.e., pages that have no out links
// M(p) is the set of pages that link to page p
// L(q) is the number of out-links from page q
// d is the PageRank damping/teleportation factor; use d = 0.85 as is typical

foreach page p in P
  PR(p) = 1/N                          /* initial value */

while PageRank has not converged do
  sinkPR = 0
  foreach page p in S                  /* calculate total sink PR */
    sinkPR += PR(p)
  foreach page p in P
    newPR(p) = (1-d)/N                 /* teleportation */
    newPR(p) += d*sinkPR/N             /* spread remaining sink PR evenly */
    foreach page q in M(p)             /* pages pointing to p */
      newPR(p) += d*PR(q)/L(q)         /* add share of PageRank from in-links */
  foreach page p
    PR(p) = newPR(p)

return PR
In order to facilitate the computation of PageRank using the above pseudocode, one would ideally have access to an in-link respresentation of the web graph, i.e., for each page p, a list of the pages q that link to p. For example, given the graph shown in Problem 5 from HW03, one such representation would be as follows
A D E F
B A F
C A B D
D B C
E B C D F
F A B D
where the first line indicates that page A is linked from pages D, E, and F, and so on. Note that unlike the example in HW03, in a real web graph, not every page will have in-links, nor will every page have out-links.

The Project

What to Submit

In addition to the items mentioned above, you should hand in a copy of your code, which should hopefully be relatively short. (The pseudocode for the iterative PageRank algorithm is 10 lines long, as shown above.)

Grading

This project is worth a maximum of 200 points.