Kruskal's Algorithm and Disjoint Sets

Kruskal's Algorithm

Prim's Algorithm constructs a minimal spanning tree by growing a single tree. Kruskal's Algorithm constructs a minimal spanning tree by merging multiple trees. Recall that a tree is a connected acyclic graph.

The algorithm begins by sorting the edges by their weights. Beginning with an empty sub graph, the algorithm scans the list of edges adding the next edge to the sub graph if it does not create a cycle.

Algorithm Kruskal(G)

sort E by the edge weights

E_T ← {} // empty set

k ← 0

while ecounter < |V|-1 do

k++

if E_T union {e_k} is acyclic then

E_T ← E_T union {e_k}

ecounter++

return E_T

illustrate the algorithm

Another interpretation of Kruskal's algorithm is initially makes |V| single node trees (or sets). Each iteration takes the smallest remaining edge (u, v) from a list, finds the two trees (or sets) containing u and v, and checks that the trees (or sets) are not the same. Passing all these tests, the trees (or sets) are connected (or merged).

Is it possible to connect two trees that do not share vertices with a single edge and make a cycle? No. Draw a picture.

The cost depends on finding and merging the trees (or sets). A data structure for finding and merging sets is called Disjoint Sets.

Disjoint Sets

Disjoint Sets is a data structure which partitions a set of items. A partition is a set of sets such that each item is in one and only one set. It has operations:

makeset(x) - makes a set from a single item

find(x) - finds the set that x belongs to

union(x, y) - makes the union of the sets containing x and y

We can assume that the items are represented by integers, which can be the index into an array.

There are two popular implementations for disjoint sets, using linked lists or using trees.

Quick Find - Linked List

Uses linked lists to represent the sets, and an array, called representative array, which is indexed by the item number and the value give the set name (smallest integer member in the set).

The operation makeset is obvious, update the representative array and make the single element link list. The cost is Θ(1).

The operation find is also obvious, just access the representative array. The cost is Θ(1).

The operation union is more expensive. Join the two link list (easy enough) but the representative arrays must be update. This cost is linear in the set size.

For sequence of n random unions the cost is Θ(n²). We can do better if the set name of the representative array is the larger set, then alogrithm only needs to update the representative array for the smaller array. This is called union by size. Then the cost is O(n lg n) because the set size doubles after each union.

The cost of n-1 unions and m finds is O(n lg n+ m)

Quick Merge - Tree

This implementation uses trees of the items to represent the sets. The integer in the root of the tree is the set name. The links of the tree point from the children to the parent. Note this is not a binary tree and the links point in the opposite direction of most trees.

The operation makeset is obvious, just make a single node tree. The cost is Θ(1).

The operation union links the root of one tree to the root of the other tree. The cost is Θ(1).

The operation find requires traversing up the tree and costs Θ(h), where h is the height of the tree. The height could be on the order of the set size. To control the cost, the union should make the smaller tree in the union operation the sub tree of the larger tree. This is union by size (by set size) or union by rank (by tree height). Naturally this requires storing the tree size or height in the root. Using union by size or rank the height of tree is logarithmic with the number of unions (in other words the tree/set size).

The cost for n-1 unions and m finds is O(n + m lg n)

We can do even better by using path compression. Path compression makes every node encounter during a find linked with the root directly. Then a sequence of n-1 unions and m finds is only slightly more than linear in n and m.

Cost of Kruskal's algorithm

Below is another version of Kruskal's algorithm that makes the disjoint sets explicit.

Algorithm Kruskal(G)

sort E by the edge weights // Note this is a Priority Queue

make disjoint sets of vertices, d(v)

while ecounter < |V|-1 and E is not empty do

(u, v) ← remove minimum from E

if find(u) ≠ find(v) then

T ← union(u, v)

ecounter++

return T

What is the priority queue? E sorted.

What is the maximum number of finds? m = |E| finds

What is the maximum number of unions? n = |V| unions, because only n vertices are added to the minimum spanning tree.

So there are at most m find and n unions.

The total cost is the cost of making the priority queue of edges (sorting E) and the disjoint set finds and unions.

If the implementation of disjoint sets are trees with path compression, the cost of the of the disjoint set finds and unions are O(n + m)

Note that for a connected graph n ε O(m), disjoint sets operations are bounded by O(m).

Then the total cost of Kruskal's algorithm it is bounded by sorting the edges, O(m lg m) for a connected graph.