Union Find (from Chapter 5.1.4)

Notes by Gene Cooperman, © 2009 (may be freely copied as long as this copyright notice remains)

(For the 2009 year, this topic is not core material. If it is on any exam, the background of the algorithm will first be reviewed. Regardless, it is a beautiful and elegant topic for those who enjoy algorithms.)

Statement of Problem and Where It's Used (see text for now). (Also note that there is a Wikipedia article on Union-Find.

Union-Find Algorithm

If Union-Find is done naively, it will have time O(n2). Two key heuristics make it faster:

  1. When connecting two components, add a pointer from the representative of the smaller component to the representative of the larger component.
  2. Path Compression: If you have to follow several pointers to find the representative of your component, then do it twice! The second time, for each vertex that you visit, change the pointer to point directly to the component representative. Now, if you are asked a third time for the representative of any vertex on this path, you will do it in O(1) steps.

The second principle is called path compression. If you have time to implement only one heuristic, implement path compression. That will make your algorithm O(n log n). In addition, for many common cases, the actual number of steps will be closer to c n than to c n log n, for some constant c.

The first principle is still too vague to have a name. When we say the "larger" component, we could mean based on the number of vertices in that component. But that would be inefficient to compute. (Because we have some indirect pointers that eventually lead to the component representative, it's difficult to update in the representative the total number of vertices in that component.)

There are two ways to get around the problem. In one way, every time we add a new edge, we could immediately do path compression along that edge to guarantee that the two vertices on either side of the edge immediately point to the component representative. If we do that, then we will have a O(n log n) union-find algorithm.

The reason that the previous solution to the first principle is O(n log n) is that we could start with 8 vertices. Then we add 4 edges to create 4 componensts of 2 elements each. Then we add 2 edges to create 2 components of 4 elements each, and do path compression to make all pointers point to the new representative. Then we add 1 edge to create 1 component of 7 elements, and do path compression on all indirect pointer. Generalizing this, we find that the extra path compressions force us to do O(n log n) total work.

So, people use a less accurate (but more efficient to compute) method for deciding which component is larger. The rank of a component is the longest path (following pointers) in that component. Each component representative stores the rank of its component. When we add a single vertex to a component, we immediately do path compression (or else we would have a hard time efficiently updating the rank of that component, which is stored within the component representative). When we combine two components, the rank of the new component, rank(C), will be exactly max(rank(A), rank(B)) + 1. Test yourself by showing why.

The rank heuristic is called Union by Rank. The combination of the two heuristics leads to an algorithm whose complexity is O(n Ack-1,/sup>(n)), where Ack-1,/sup>(n) is the inverse Ackermann function. This is one of the slowest growing functions known to mathematics. We (and text) do not prove the complexity O(n Ack-1,/sup>(n)), but it can be found in books like Cormen, Leiserson, Rivest, and Stein, for those who are interested.

So, in summary, an almost linear algorithm for Union-Find exists. It works by combining the two heuristics:

  1. Union by Rank
  2. Path Compression

Note: Path compression is a simple form of dynamic programming or memoizing. We will discuss that in the next set of notes.