IS4200/CS6200 12S Project 02

Created: Tue 13 Mar 2012
Last modified:

Assigned: Tue 13 Mar 2012
Due: Tue 27 Mar 2012

PageRank

In this project, you will compute PageRank on a collection of 183,811 web documents. Consider the version of PageRank described in class and in HW03. PageRank can be computed iteratively as show in the following pseudocode:

// P is the set of all pages; |P| = N
// S is the set of sink nodes, i.e., pages that have no out links
// M(p) is the set of pages that link to page p
// L(q) is the number of out-links from page q
// d is the PageRank damping/teleportation factor; use d = 0.85 as is typical

foreach page p in P
  PR(p) = 1/N                          /* initial value */

while PageRank has not converged do
  sinkPR = 0
  foreach page p in S                  /* calculate total sink PR */
    sinkPR += PR(p)
  foreach page p in P
    newPR(p) = (1-d)/N                 /* teleportation */
    newPR(p) += d*sinkPR/N             /* spread remaining sink PR evenly */
    foreach page q in M(p)             /* pages pointing to p */
      newPR(p) += d*PR(q)/L(q)         /* add share of PageRank from in-links */
  foreach page p
    PR(p) = newPR(p)

return PR

In order to facilitate the computation of PageRank using the above pseudocode, one would ideally have access to an in-link respresentation of the web graph, i.e., for each page p, a list of the pages q that link to p. For example, given the graph shown in Problem 5 from HW03, one such representation would be as follows

A D E F
B A F
C A B D
D B C
E B C D F
F A B D

where the first line indicates that page A is linked from pages D, E, and F, and so on. Note that unlike the example in HW03, in a real web graph, not every page will have in-links, nor will every page have out-links.

The Project

Implement the iterative PageRank algorithm as described above. Test your code on the example from HW03 using the input representation given above. Be sure that your code handles pages that have no in-links or out-links properly. (You may with to test on a few such examples.)
To hand in: List the PageRank values you obtain for each vertex after 1, 10, and 100 iterations of the PageRank algorithm.
Download the in-links file for the WT2g collection, a 2GB crawl of a subset of the web. This in-links file is in the format described above for the HW03 example.
Run your iterative version of PageRank algorithm until your PageRank values "converge". To test for convergence, calculate the perplexity of the PageRank distribution, where perplexity is simply 2 raised to the entropy of the PageRank distribution, i.e., 2^H(PR). Perplexity is a measure of how "skewed" a distribution is --- the more "skewed" (i.e., less uniform) a distribution is, the lower its preplexity. Informally, you can think of perplexity as measuring the number of elements that have a "reasonably large" probability weight; technically, the perplexity of a distribution with entropy h is the number of elements n such that a uniform distribution over n elements would also have entropy h. (Hence, both distributions would be equally "unpredictable".)
Run your iterative PageRank algorthm, outputting the perplexity of your PageRank distibution until the perplexity value no longer changes in the units position for at least four iterations. (The units position is the position just to the left of the decimal point.)
For debugging purposes, here are the first five perplexity values that you should obtain (roughly):

183811, 79669.9, 86267.7, 72260.4, 75132.4
To hand in: List the perplexity values you obtain in each round until convergence as described above.
Sort the collection of web pages by the PageRank values you obtain.
To hand in: List the document IDs of the top 10 pages as sorted by PageRank, together with their PageRank values.
Examine these top 10 pages in Lemur by using the "e=docID" option with database "d=0", which is the index of the WT2g collection. For example, the link

http://fiji4.ccs.neu.edu/~zerg/lemurcgi_IRclass/lemur.cgi?d=0&e=WT04-B22-268
will bring up document WT04-B22-268, which is an article on the Comprehensive Test Ban Treaty.
To hand in: Explain why these documents have high PageRank values, i.e., why is it that these particular pages are linked to by (possibly) many other pages with (possibly) high PageRank values. Are all of these documents ones that users would likely want to see in response to an appropriate query? Which one are and which ones are not? For those that are not "interesting" documents, why do they have high PageRank values? In short, give an analysis of the PageRank results you obtain.

What to Submit

In addition to the items mentioned above, you should hand in a copy of your code, which should hopefully be relatively short. (The pseudocode for the iterative PageRank algorithm is 10 lines long, as shown above.)

Grading

This project is worth a maximum of 200 points.