Strongly Connected Components (from Chapter 3.4.2)

The textbook is not as clear as one would like for their linear-time algorithm for finding strongly connected components. This expands on it. Recall that the textbook's algorithm is:

Run depth-first search on G^R
Run the undirected connected components algorithm (from Section 3.2.3) on G, and during the depth-first search, process the vertices in decreasing order of their post numbers from step 1.

A more elaborate description follows:

Starting at any vertex, execute dfs() and record the post numbers.
Now execute dfs() on the reverse graph, starting at the vertex having the hightest post numbers, and calling explore() on vertices with higher post numbers first.

2.a. Execute dfs() on the reverse graph. Start at the vertex having the highest post number. As given in dfs(), then call explore() on that vertex.

2.a. NOTE: In general, the call from dfs() to explore() will not find all the vertices. After the recursive calls to explore(), it will return to dfs(). The vertices that are visited forms a single strongly connected component.

2.b. Back in dfs(), among the unvisited vertices, next choose the vertex with the highest post number from step 1. As given in dfs(), then call ex;lore() on that vertex and the vertices visited will be the next strongly connected component.

2.c. If there are more unvisited vertices, go to step 2.b.

As for why this works, the first two principles below are the key:

The node that receives the highest post number in a depth-first search must lie in a source strongly connected component. (Property 2 in the textbook.)
In a reverse graph, a source strongly connected component becomes a sink strongly connected component.
[NOT NEEDED FOR THIS ALGORITHM] The post numbers of a sink strongly connected component are contiguous.

The principles imply that in Step 1 of the algorithm, we find a vertex from a source strongly connected component. (If the initial vertex was already part of a source strongly connected component, then the highest post number will just be the initial vertex. In other cases, it can be different.) The vertex with highest post number from Step 1 must therefore be part of a sink strongly connected component in the reverse graph.

So, in Step 2, we start in a sink strongly connected component. The routine explore() recursively visits all vertices of this sink strongly connected component, and then exits to dfs().

If we remove the sink strongly connected component from the reverse graph, then the remaining unvisited vertex with highest post number from Step 1 must be part of a sink strongly connected component of the remaining reverse graph. So, the recursive call to explore() will now return exactly one more strongly connected component.

At the end of Chapter 3, the textbook provides an example using the directed graph in Figure 3.9. That example should now be clear, using the above explanation. When there is more than one outgoing edge, the book uses the edge leading toward the vertex with the highest post number as the "tie breaker", so that everybody will visit the nodes in the same order.

Page Rank (Google algorithm)

In a highly simplified World Wide Web, one would like to represent the web as a directed graph with hyperlinks being directed edges. To keep things simple, let's also assume that the web pages are divided into strongly connected components according to common interests. (This is not true in reality, but it gives on a starting point for thinking about the problem.)

In fact, in reality, Google uses a more realistic analysis, using a statistical approach. Here is a Wikipedia article on the There is a Wikipedia article on the [Google Page Rank Algorithm|http://en.wikipedia.org/wiki/Page_rank].