data-structures tree time-complexity disjoint-sets union-find

How to correctly implement wighted union and path compression in a UnionFind data structure

I am trying to implement a Union-Find/Disjoint-Set data structure in C, using weighted Union and path compression in Find. I have some questions as to how the weighted union should be implemented to reduce time complexity when compared to the non weighted Union.

I have already found several solutions to this problem online and have implemented my own. In every solution, the root of each separate tree (representing a set) holds the number of nodes of the tree at all times. When uniting the sets of two random objects that belong to a different set, the roots are first found (path compression is used here) and then we compare the sizes of these roots. The root of the biggest tree is set as the parent of the of the root of the smallest tree.

In my understanding however, what we are trying to achieve with a weighted union is to reduce the height of the resulting tree (which is also what we are trying to achieve with path compression). Hence, it is no the tree with the lowest number of Nodes that should be connected to the other tree, but the tree with the lowest height. This keeps the height to a minimum.

Is this correct? Is checking the height and the size somehow equivalent given the rest of the implementation (we always start with a number of single (one node) sets)?

Supposing that it is the height that needs to be checked, keeping track of the height of a tree if path compression is not used is fairly straightforward. I have not however found a way to keep track of the height of the tree when path compression is used (at least not without traversing the whole tree, which increases the time complexity of the "find" algorithm.

Here is an example of an implementation I have found and uses what I described (very similar to what I have coded) in c++: https://github.com/kartikkukreja/blog-codes/blob/master/src/Union%20Find%20(Disjoint%20Set)%20Data%20Structure.cpp

Solution

It looks like you've pretty much figured this all out yourself.

Union-by-height is the obvious way to make the shortest tree, but it's hard to keep track of the height when you use path compression...

So union-by-rank is commonly used instead. The 'rank' of a set is what it's height would be if we didn't do any path compression, so when you use union-by-rank with path compression it's like starting with union-by-height and then applying path compression as an optimization, ensuring that the path compression doesn't change how the merges work.

A lot of people (myself included) use union-by-size, however, because the size is often useful, and it can be shown that union-by-size produces the same worst-case complexities as union-by-rank. Again in this case, path compression doesn't affect the merges since it doesn't change the size of any roots.