loops recursion stack-overflow tail-recursion union-find

Can recursive union find be optimized?

When implementing union-find, I would usually write the find function with path compression like this:

def find(x):
    if x != par[x]:
        par[x] = find(par[x])
    return par[x]

This is easy to remember and arguably easy to read. This is also how many books and websites describe the algorithm.

However, naively compiled, this would use stack memory linear in the input size. In many languages and systems that would by default result in a stack overflow.

The only non-recursive way I know of writing find is this:

def find(x):
    p = par[x]
    while p != par[p]:
        p = par[p]
    while x != p:
        x, par[x] = par[x], p
    return p

It seems unlikely that many compilers would find that. (Maybe Haskell would?)

My question is in what cases it is safe to use the former version of find? If no widely used language can remove the recursion, shouldn't we tell people to use the iterative version? And might there be a simpler iterative implementation?

Solution

There seem to be two separate questions here.

First - can optimizing compilers notice this and rewrite it? It's difficult to answer this question without testing all compilers and all versions. I tried this out using gcc 4.8.4 on the following code:

size_t find(size_t uf[], size_t index) {
  if (index != uf[index]) {
    uf[index] = find(uf, uf[index]);
  }
  return uf[index];
}

void link(size_t uf[], size_t i, size_t j) {
  uf[find(uf, i)] = uf[find(uf, j)];
}

This doesn't implement the union-by-rank optimization, but does support path compression. I compiled this using optimization level -O3 and the assembly is shown here:

 find:
.LFB23:
    .cfi_startproc
    pushq   %r14
    .cfi_def_cfa_offset 16
    .cfi_offset 14, -16
    pushq   %r13
    .cfi_def_cfa_offset 24
    .cfi_offset 13, -24
    pushq   %r12
    .cfi_def_cfa_offset 32
    .cfi_offset 12, -32
    pushq   %rbp
    .cfi_def_cfa_offset 40
    .cfi_offset 6, -40
    pushq   %rbx
    .cfi_def_cfa_offset 48
    .cfi_offset 3, -48
    leaq    (%rdi,%rsi,8), %rbx
    movq    (%rbx), %rax
    cmpq    %rsi, %rax
    je  .L2
    leaq    (%rdi,%rax,8), %rbp
    movq    0(%rbp), %rdx
    cmpq    %rdx, %rax
    je  .L3
    leaq    (%rdi,%rdx,8), %r12
    movq    %rdx, %rax
    movq    (%r12), %rcx
    cmpq    %rcx, %rdx
    je  .L4
    leaq    (%rdi,%rcx,8), %r13
    movq    %rcx, %rax
    movq    0(%r13), %rdx
    cmpq    %rdx, %rcx
    je  .L5
    leaq    (%rdi,%rdx,8), %r14
    movq    %rdx, %rax
    movq    (%r14), %rsi
    cmpq    %rsi, %rdx
    je  .L6
    call    find           // <--- Recursion!
    movq    %rax, (%r14)
.L6:
    movq    %rax, 0(%r13)
.L5:
    movq    %rax, (%r12)
.L4:
    movq    %rax, 0(%rbp)
.L3:
    movq    %rax, (%rbx)
.L2:
    popq    %rbx
    .cfi_def_cfa_offset 40
    popq    %rbp
    .cfi_def_cfa_offset 32
    popq    %r12
    .cfi_def_cfa_offset 24
    popq    %r13
    .cfi_def_cfa_offset 16
    popq    %r14
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc

Given the existence of the recursive call in the middle, it doesn't look like this tail call was eliminated. In fairness, that's because the transformation you're describing is pretty nontrivial, so I'm not surprised it didn't find it. This doesn't mean that no optimizing compiler can find it, but that one major one won't.

Your second question is why we present the algorithm this way. As someone who teaches both algorithms and programming, I think it's extremely valuable to discuss algorithms using a presentation that's as simple as possible, even if it means abstracting away some particular implementation details. Here, the key idea behind the algorithm is to update the parent pointers of all the nodes encountered up on the way to the representative. Recursion happens to be a pretty clean way of describing that idea, even though when implemented naively it risks a stack overflow. However, by expressing the pseudocode in that particular way, it's easier to describe and discuss it and to prove that it will work as advertised. We could describe it the other way to avoid a stack overflow, but in Theoryland we usually don't worry about details like that and the updated presentation, while more directly translatable into practice, would make it harder to see the key ideas.

When looking at pseudocode for more advanced algorithms and data structures, it's common to omit critically important implementation details and to handwave that certain tasks are possible to do in certain time bounds. When discussing algorithms or data structures that build on top of even more complex algorithms and data structures, it often becomes impossible to write out pseudocode for everything because you have layers on top of layers on top of layers of glossed-over details. From a theoretical perspective, this is fine - if the reader does want to implement it, they can fill in the blanks. On the other hand, if the reader is more interested in the key techniques from the paper and the theory (which in academic settings is common), they won't get bogged down in implementation details.