Can recursive union find be optimized?

When implementing union-find, I would usually write the find function with path compression like this:

def find(x):
    if x != par[x]:
        par[x] = find(par[x])
    return par[x]

This is easy to remember and arguably easy to read. This is also how many books and websites describe the algorithm.

However, naively compiled, this would use stack memory linear in the input size. In many languages and systems that would by default result in a stack overflow.

The only non-recursive way I know of writing find is this:

def find(x):
    p = par[x]
    while p != par[p]:
        p = par[p]
    while x != p:
        x, par[x] = par[x], p
    return p

It seems unlikely that many compilers would find that. (Maybe Haskell would?)

My question is in what cases it is safe to use the former version of find? If no widely used language can remove the recursion, shouldn't we tell people to use the iterative version? And might there be a simpler iterative implementation?


  • There seem to be two separate questions here.

    First - can optimizing compilers notice this and rewrite it? It's difficult to answer this question without testing all compilers and all versions. I tried this out using gcc 4.8.4 on the following code:

    size_t find(size_t uf[], size_t index) {
      if (index != uf[index]) {
        uf[index] = find(uf, uf[index]);
      return uf[index];
    void link(size_t uf[], size_t i, size_t j) {
      uf[find(uf, i)] = uf[find(uf, j)];

    This doesn't implement the union-by-rank optimization, but does support path compression. I compiled this using optimization level -O3 and the assembly is shown here:

        pushq   %r14
        .cfi_def_cfa_offset 16
        .cfi_offset 14, -16
        pushq   %r13
        .cfi_def_cfa_offset 24
        .cfi_offset 13, -24
        pushq   %r12
        .cfi_def_cfa_offset 32
        .cfi_offset 12, -32
        pushq   %rbp
        .cfi_def_cfa_offset 40
        .cfi_offset 6, -40
        pushq   %rbx
        .cfi_def_cfa_offset 48
        .cfi_offset 3, -48
        leaq    (%rdi,%rsi,8), %rbx
        movq    (%rbx), %rax
        cmpq    %rsi, %rax
        je  .L2
        leaq    (%rdi,%rax,8), %rbp
        movq    0(%rbp), %rdx
        cmpq    %rdx, %rax
        je  .L3
        leaq    (%rdi,%rdx,8), %r12
        movq    %rdx, %rax
        movq    (%r12), %rcx
        cmpq    %rcx, %rdx
        je  .L4
        leaq    (%rdi,%rcx,8), %r13
        movq    %rcx, %rax
        movq    0(%r13), %rdx
        cmpq    %rdx, %rcx
        je  .L5
        leaq    (%rdi,%rdx,8), %r14
        movq    %rdx, %rax
        movq    (%r14), %rsi
        cmpq    %rsi, %rdx
        je  .L6
        call    find           // <--- Recursion!
        movq    %rax, (%r14)
        movq    %rax, 0(%r13)
        movq    %rax, (%r12)
        movq    %rax, 0(%rbp)
        movq    %rax, (%rbx)
        popq    %rbx
        .cfi_def_cfa_offset 40
        popq    %rbp
        .cfi_def_cfa_offset 32
        popq    %r12
        .cfi_def_cfa_offset 24
        popq    %r13
        .cfi_def_cfa_offset 16
        popq    %r14
        .cfi_def_cfa_offset 8

    Given the existence of the recursive call in the middle, it doesn't look like this tail call was eliminated. In fairness, that's because the transformation you're describing is pretty nontrivial, so I'm not surprised it didn't find it. This doesn't mean that no optimizing compiler can find it, but that one major one won't.

    Your second question is why we present the algorithm this way. As someone who teaches both algorithms and programming, I think it's extremely valuable to discuss algorithms using a presentation that's as simple as possible, even if it means abstracting away some particular implementation details. Here, the key idea behind the algorithm is to update the parent pointers of all the nodes encountered up on the way to the representative. Recursion happens to be a pretty clean way of describing that idea, even though when implemented naively it risks a stack overflow. However, by expressing the pseudocode in that particular way, it's easier to describe and discuss it and to prove that it will work as advertised. We could describe it the other way to avoid a stack overflow, but in Theoryland we usually don't worry about details like that and the updated presentation, while more directly translatable into practice, would make it harder to see the key ideas.

    When looking at pseudocode for more advanced algorithms and data structures, it's common to omit critically important implementation details and to handwave that certain tasks are possible to do in certain time bounds. When discussing algorithms or data structures that build on top of even more complex algorithms and data structures, it often becomes impossible to write out pseudocode for everything because you have layers on top of layers on top of layers of glossed-over details. From a theoretical perspective, this is fine - if the reader does want to implement it, they can fill in the blanks. On the other hand, if the reader is more interested in the key techniques from the paper and the theory (which in academic settings is common), they won't get bogged down in implementation details.