Disjoint set / Union-Find implemented in Python

Edward Erelt
4 min readMay 15, 2021

--

Consider the following scenario: We have a list of people and the names of the people this person is connected to. This type of data is the basis for social networks. For example a list of people in Facebook and their friends or a list of accounts in Twitter and their followers.

Are Mary and Jack indirectly connected?

It is easy to check whether two people are directly connected. What we would also like to know however, is whether two people are connected indirectly via other people — are they part of the same group (set). For example person A is friends with B and person B is friends with C, so A and C are in the same set. This is useful for many applications. For example advertising — people A and C may have similar purchasing habits since they belong to the same social group.

A disjoint-set data structure keeps track of which set an object is in. The union-find algorithm allows to efficiently perform two operations:
Find: Find the set an object is in. Input to this function is an object in the disjoint-set. This can be used to check whether two objects are in the same set.
Union: Connect two objects — if neither object is in a set this means creating a new set. If only one object is in a set then this means adding the other to the set. If both are in sets then it means merging two sets into one.

The following API specifies the Union-find class:

We use a site-indexed list id[] as the basic data structure to represent the members in our disjoint-set. The API already specifies that members will be identified by integer values between 0 and N-1. Each index in the list id[] represents one member in our disjoint-set. Each value represents one connection of that member to another member (possibly to itself). For example if id=[1,1,3,3,1], then members 0, 1 and 4 are connected to 1 while members 2 and 3 are connected to 3. Initially we start with N sets so we initialise id[i] to i from 0 to N-1.

All of our implementations use one-line implementation of connected() that return boolean value find(p)==find(q).

Our starting point for Union-Find is given below. We maintain two variables — the count of sets as an integer count and the list of members id[]. The implementation of methods find() and union() are the topic for the rest of the article.

Quick-find

The first and easiest approach is to maintain the same value in id[] for each set. Members p and q are in the same set if and only if id[p]==id[q]. In that case find(p) reduces to returning the value id[p] in constant time.

For union(p, q) we first check whether p and q are already in the same set. If they are then there is nothing more for us to do. However, if they are not then we will have to change all entries in the list id[] that equal id[q] to id[p]. To do that we will have to iterate over the whole list in time proportional to N. Notice that we could have equivalently decided to change all entries equal to id[p] to id[q].

The find() method in quick-find is certainly very quick. But quick-find is not useful for large dynamic problems, where we need to add new connections between members since the union() method needs to iterate through whole list.

Quick-union

Quick-union speeds up the union() operation. It uses same underlying id[] list data structure but interprets the values differently. Each entry in id[] is now the name of another member in the set — which we refer to as link. To implement find() we start at the given member, follow the link to another member, then to another member and so on until we reach the root, a member that has a link to itself. Two members are in the same set only if their root is the same member.

To implement union(p, q) we follow the links of p and q to their roots and then link the roots together by setting the root of q to be p. Again, we could equivalently set the root of p to be q. The union() uses time proportional to find() hence the name quick-union.

Quick-union is usually much faster than quick-find since it does not iterate through the whole data for every union(). However, the cost of union() depends on the nature of the inputs. In the worst case, when running union(2, 1), union(3, 2) … union(N-1, N-2) in sequence, the resulting data structure requires the find(p) to iterate through the list from p to N-1 in order to reach the root. We will solve this problem with quick-union.

Weighted Quick-union

Luckily, there is an easy way to ensure the tree does not grow taller than log2(N). Instead of arbitrarily connecting the roots of p and q, we keep track of the size of each tree and always connect the root of the smaller tree to the root of the larger tree. This change requires a bit of extra code and for us to keep an extra list size[] for the size of each tree, but leads to significant improvement in performance.

Summary

Weighted quick-union completes 2,000,000 union operations in 5 seconds on a laptop. This is sufficiently fast for most real-life scenarios.

The code is available at:

--

--

Edward Erelt
0 Followers

Coder and entrepreneur, AI and deep learning optimist