When I started to learn data structures many years ago, `disjoint-set`

is the one that gave me a deep impression, **it’s so simple, but elegant and useful**.

In this post, we will learn how `disjoint-set`

works, then we implement a maze generator with it. I wish this will help the beginners to find fun of data structures.

## Disjoint Set (Union-Find) Data Structure

`disjoint-set`

is a data structure that tracks a set of elements partitioned into a number of disjoint (non-overlapping) subsets. According to the definition, two disjoint sets are sets whose intersection is the empty set.

In above image: e1, e2 is in `Set1`

and e3, e4, e5 is in `Set2`

.

There are two operations for a disjoint-set, `union`

and `find`

, this is the reason why disjoint-set also be called with the name `union-find`

:

make_set() => create a disjoin-set union(e1, e2) => link 'e1' and 'e2' to the same subset find(e) => return the subset's root for 'e'

If we union any two elements from two different sets, then they will be merged into one larger set.

For example, if e2 and e3 are unioned, which also means e1 and e5 in the same set.

With the result of `find`

, we can test whether two elements belong to the same set. If `find(e1)`

equals `find(e2)`

, this means e1 and e2 are in the same set.

This is a useful property of disjoin-set, especially for graph algorithms, in which we need to test whether two nodes have connectivity with each other.

Given the above specs and descriptions, how to implement this data structure by yourself?

You can spend several minutes trying it out by your hand, this is a superior way to study data structures and algorithms.

## Naive implementation with an array

Firstly, we need to find out use which basic data structure in programming languages to store data. If we could store the parent of one element, we could construct a tree-like structure to keep the relationship between elements.

Let’s take an example, if we finished operations of `union(e1, e2)`

, `union(e2, e3)`

, `union(e3, e4)`

, the constructed tree will be like this(suppose we always set the first parameter as a parent):

The function `find(e)`

will be implemented in recursive style, try to find parents from bottom to top, according to the tree.

```
#define MAX_SIZE 60
int parent[MAX_SIZE]; // the parent root for each node
void make_set() {
for(int i=0; i < MAX_SIZE; ++i )
parent[i] = i;
}
int find(int x) {
// recursive find and path compression
if(x != parent[x])
return find(parent[x]);
return parent[x];
}
void union_set(int e1, int e2) {
if( x == y ) return;
parent[e2] = e1; // e1 is always the parent
}
```

Ok, we finished the first version of `union-find`

with array!

Let’s run more test cases for verifying the correctness of our implementation, what happened if we run operations of `union(e1, e2)`

, `union(e3, e4)`

, `union(e2, e4)`

?

Wait! This is not what we expect! According to the definition above, we expect e1, e2, e3, e4 are in the same set. Seems we have a bug in the function of `union_set`

.

Yes! What we want to do is union two sets, not just the parent relationship of two elements. When we `union(e2, e4)`

, we need to find the root of e2 and e4 and finish it with an operation of `union(e1, e3)`

.

Let’s fix the bug in the first version:

```
void union_set(int e1, int e2) {
int root1 = find(e1);
int root2 = find(e2);
if( root1 == root2 ) return;
parent[root2] = root1; // e1's root is always the parent
}
```

Naive implementation’s time complexity is O(N).

## #1 Optimization with Path Compression

Let’s try to make it faster. In the naive recursive function of `find`

, if we cached the result in each recursive iteration, the time performance will be much better. This is the caching technique which explained in Caching: from bottom to top.

```
int find(int x) {
// recursive find with path compression
if(x != parent[x])
parent[x] = find(parent[x]);
return parent[x];
}
```

After a function call of `find`

, we can make the tree much flatter(the depth is shorter, which speeding up future operations on these elements):

## #2 Optimization with Union by Rank

In our first version of `union_set`

, we always set e1’s root as the parent of e2, what will happen after multiple times of union operations?

The depth of the tree will keep growing. Can we do some optimizations for this? A strategy called union-by-rank will help with this.

If we use another array to record the rank(the depth of an element’s subtree), then we always try to set the element with larger depth as the parent.

```
int rank[MAX_SIZE]; // the depath rank, initialized with 0 in make_set
void union_set(int e1, int e2) {
int x = find_set(e1);
int y = find_set(e2);
if( x == y ) return;
// optimization with rank
if(rank[x] > rank[y])
parent[y] = x;
else {
parent[x] = y;
if(rank[x] == rank[y]) ++rank[y];
}
}
```

Let’s take an example to understand this:

With the operation of `union by rank`

, when the depth of two roots is not equal, we can keep the depth of tree not to grow!

There is another strategy called `union by size`

, which always attaches the tree with fewer elements to the root of the tree with more elements.

The time analysis for the optimized version involves many mathematics tricks.

Let’s simply come out the conclusion, with path compression and union by rank, the amortized time complexity is only O(α(n)), where α(n) is the inverse Ackermann function, α(n) is less than 5 for any practical input size n.

It’s a really good time performance!

## Random Maze Generation

Finishing a small project with the newly learned data structure will do much help for your learning.

When I read Algorithms by Robert Sedgewick, there was an exercise to generate a random maze with disjoin-set.

It’s a nifty algorithm, the simplified code like this:

```
void rand_maze {
// begin with a rectangular maze of all closed cells
// numrows = number of rows of cells;
// numcols = number of columns of cells;
start = cell at (0,0);
goal = cell at (numrows-1, numcols-1);
numcells = numrows * numcols;
Partition p(numcells); // p represents the maze components
// goal is not reachable from start
while (!p.Find(start, goal)) {
edge = randomly select a wall;
x = edge.x;
y = edge.y;
if(!p.Find(x,y)) {
remove edge;
// x and y now in same component
p.Union(x,y);
}
}
}
```

We use disjoin-set to test whether `start`

and `goal`

have been connected with each other, randomly remove an edge in the map until they are connected.

A generated maze look like this, with start cell at left-up corner and goal at the right-bottom corner: