Notes by Gene Cooperman, © 2009 (may be freely copied as long as this copyright notice remains)
The textbook is not as clear as one would like on their linear-time algorithm for finding strongly connected components. This expands on it. Recall that the text's algorithm is:
In the algorithm, the wording "undirected connected components algorithm" is confusing. They mean to take the algorithm originally designed for undirected graphs, and now to run it on the current problem, which is a directed graph.
Note also, that all of the depth-first search algorithms say to choose an initial vertex, use depth-first, and eventually after exploring the connected component to return to the initial vertex. Then if there remain unvisited vertices, then choose a new initial vertex from the unvisited states, and continue the depth-first search. Stop only when you have visited all vertices.
This section explains the intuition of why the algorithm in the textbook works. The book has shown that a directed graph can be viewed as a DAG of strongly connected components. Our task is to list the strongly connected components.
Since each strongly connected component is a node in a larger DAG, we can speak of sink strongly connected components and source strongly connected components.
A sink strongly connected component is particularly easy to find. In a depth-first search, if we ever enter a sink strongly connected component, we will never leave it until we have explored every vertex in the component. (This is because the sink component is exactly the reachable set of the first vertex that we discover within the sink component.)
We also note that the first sink strongly connected component that we visit must have contiguous post numbers, since we must eventually visit every vertex of the strongly connected component before we can back up out of the sink strongly connected component. We will know it is a sink, since no vertex of this strongly connected component has an edge leading out of the strongly connected component. (The post numbers need not be the smallest, since we might have previously visited a cycle, and then before finishing backing out of the cycle, we might have backed into a vertex with two outgoing edges. The second outgoing edge might have led into the current sink strongly connected component.)
When we return to the last vertex on the stack outside of the first sink strongly connected component (where we had originally first entered the sink), we can now imagine cutting out the sink strongly connected component from our graph, since every node there has already been visited. In the remaining graph, we have not yet discovered any sink strongly connected components, and we are in the middle of a depth-first search. So, we can continue our depth-first search, and so discover our second strongly connected component. (It is even possible that the second strongly connected component has been fully visited, but in that case, some of its components are still on the stack. So, we will identify the second strongly connected component when we exit from the last vertex of that component.)
We continue in this way to find all strongly connected components. There is a way to identify each strongly connected component in linear time at the moment that the last vertex of the component is popped from the stack.
The textbook prefers an algorithm that doesn't need extra pseudo-code to identify strongly connected components as we leave them. To do this, the textbook prefers to use post numbers and to analyze the reverse graph, GR, instead of the ordinary graph, G. But that's okay. The reverse graph, GR, has the same strongly connected components as the graph G. (A cycle is a cycle, even if we reverse the cycle.)
We observe that the vertex with the highest post number in a depth-first search always belongs to a source strongly connected component. But if this vertex was part of a source for GR, then it must be part of a sink strongly connected component for G. So, we know how to discover a first vertex in a sink strongly connected component. (Step 1 will implicitly do it for us.) Once we are in a sink, the reachable set of that first vertex (computed by depth first search) will be a sink strongly connected component. So, we can discover the entire sink strongly connected component of G, cut it out, and then continue, finding a remaining sink strongly connected component. This is the larger intuition. Now, on to the details:
In Step 1, we compute the post numbers for the reverse graph. In the reverse graph, the vertex with the highest post number will belong to a source strongly connected component.
Back in the original graph, we know that this vertex therefore belongs to a sink strongly connected component. Since it is a sink back in the original graph, during Step 2 we are forced to explore all of the sink strongly connected component. We will then hit a dead end, since a sink has no exit.
So, we will be forced to choose a new and different initial vertex outside of this strongly connected component as part of Step 2. But due to Step 1, this new initial vertex will be a sink strongly connected component of G in the remaining graph, one the first sink strongly component has been cut out. We continue in this way.
Now, let's re-do the example using the graph of Figure 3.9, just as in the book does at the end of the chapter.
Since we reverse the graph, we will use the largest letter as our first initial vertex. So, we start at L. We can only go backwards along edges in this phase. So, after exploring "LKHGIJ", we cannot go further.
stack: LKHGIJ (but not done backtracking)
We must back up from J. Back at G, we find a second outgoing edge. This leads to E. So, we follow E B A, reach a dead end and backtrack.
So, we must choose a second initial vertex. F is the largest among the unvisited letters. From F (following reversed edges only), we initially we go to E, B, and A and then back up back to F. (From F, we chose to go to E before C because we are in a reverse graph.) Since we have backed up through E, B, and A, the three next highest vertices according to post number are:
post numbers: J I || A B E
stack: L → K → H → G
We are back at G. We back up to H. From H, we can explore F and then C. But we will have to back up to H again.
post numbers: J I || A B E || G || C F
stack: L → K → H
We are back at H. We can finish backing up to L. From L, there is no further path.
post numbers: J I || A B E || G || C F || H K L
D is the only unvisited vertex. So, we choose it as a second initial vertex. D receives the highest post number.
post numbers: (lowest) J I || A B E || G || C F || H K L || D (highest)
Start at A and use the post number of Step 1 to break ties (highest post number wins). So we travel from A to B. From B, we could have chosen to visit C, D, or E. D has the highest post number. So, we visit D. From D, we execute post(), discover that D is on the stack, and so create D as a strongly connected component. We then pop D from the stack, add D to the list of already visited strongly connected components, and return to B.
current stack: A → B
At B, we chose C next, instead of E. (C has a higher post number.) From C, we visit F and HKLJIG. G leads back to H (and H is still on the stack) and so we know that HKLJIG is a strongly connected component. We now have:
D || G I J L K H
current stack: A → B → C → F
From F (after having tried H), we can next go forwards to C. C is already on the stack, and so we have another strongly connected component. We add the strongly connected component, and back up further.
D || G I J L K H || F C
current stack: A → B
From B (having tried D and C) we must now visit E. From E (having visited F), we are led back to B, which is still on the stack. So, EB is a strongly connected component.
D || G I J L K H || F C || E B
current stack: A
It remains to return to A. So, A is still on the stack, and so A is a strongly connected. We now have:
D || G I J L K H || F C || E B || A
[ I know that the textbook places D after GIKLKH in the order of visits. I place D first. Does anyone see an error in my analysis? ]
In a highly simplified World Wide Web, one would like to represent the web as a directed graph with hyperlinks being directed edges. To keep things simple, let's also assume that the web pages are divided into strongly connected components according to common interests. (This is not true in reality, but it gives on a starting point for thinking about the problem.)
In fact, in reality, Google uses a more realistic analysis, using a statistical approach. Here is a Wikipedia article on the There is a Wikipedia article on the [Google Page Rank Algorithm|http://en.wikipedia.org/wiki/Page_rank].