c gcc recursion optimization tail-call-optimization

How 'smart' is GCC's Tail-Call-Optimisation?

I just had a discussion where the following two peices of C code were being discussed:

For-Loop:

#include <stdio.h>
#define n (196607)

int main() {
  long loop;
  int count=0;
  for (loop=0;loop<n;loop++) {
    count++;
  }
  printf("Result = %d\n",count);

  return 0;
}

Recursive:

#include <stdio.h>
#define n (196607)

long recursive(long loop) {
  return (loop>0) ? recursive(loop-1)+1: 0;
}

int main() {
  long result;
  result = recursive(n);
  printf("Result = %d\n",result);
  return 0;
}

On seeing this code, I saw recursive(loop-1)+1 and thought "ah, that's not tail call recursive" because it has work to do after the call to recursive is complete; it needs to increment the return value.

Sure enough, with no optimisation, the recursive code triggers a stack overflow, as you would expect.

with the -O2 flag however, the stack overflow is not encountered, which I take to mean that the stack is reused, rather than pushing more and more onto the stack - which is tco.

GCC can obviously detect this simple case (+1 to return value) and optimise it out, but how far does it go?

What are the limits to what gcc can optimise with tco, when the recursive call isn't the last operation to be performed?

Addendum: I've written a fully tail recursive return function(); version of the code. Wrapping the above in a loop with 9999999 iterations, I came up with the following timings:

$ for f in *.exe; do time ./$f > results; done
+ for f in '*.exe'
+ ./forLoop.c.exe

real    0m3.650s
user    0m3.588s
sys     0m0.061s
+ for f in '*.exe'
+ ./recursive.c.exe

real    0m3.682s
user    0m3.588s
sys     0m0.093s
+ for f in '*.exe'
+ ./tail_recursive.c.exe

real    0m3.697s
user    0m3.588s
sys     0m0.077s

so a (admittedly simple and not very rigorous) benchmark shows that it does indeed seem to be in the same order of time taken.

Solution

Just disassemble the code and see what happened. Without optimizations, I get this:

0x0040150B  cmpl   $0x0,0x10(%rbp)
0x0040150F  jle    0x401523 <recursive+35>
0x00401511  mov    0x10(%rbp),%eax
0x00401514  sub    $0x1,%eax
0x00401517  mov    %eax,%ecx
0x00401519  callq  0x401500 <recursive>

But with -O1, -O2 or -O3 I get this:

0x00402D09  mov    $0x2ffff,%edx

This doesn't have anything to do with tail optimizations, but much more radical optimizations. The compiler simply inlined the whole function and pre-calculated the result.

This is likely why you end up with the same result in all your different cases of benchmarking.