Performance of recursion vs accumulator style

We have two functions that compute the factorial of a given number. The first one, !, uses an accumulator style. The second, fact, uses natural recursion.

(define (! n0)
  (local (;; accumulator is the product of all natural numbers in [n0, n)
      (define (!-a n accumulator)
        (cond
          [(zero? n) accumulator]
          [else (!-a (sub1 n) (* n accumulator))])))
    (!-a n0 1)))

and

(define (fact n)
  (cond
    [(= 0 n) 1]
    [else (* n (fact (- n 1)))]))

At the bottom of Section 31, HtDP states that the naturally recursive version is often as fast if not faster than the accumulator version, but does not state the reasons why. I did some reading on this and it seems that the answer is 'tail call optimization/elimination', but the Wikipedia article seems to be at odds with what HtDP says, at least with respect to performance. Why is this so?

At work, the recursive style is faster. At home, the accumulator style is faster. Is there no general heuristic to guide a choice as to which style is generally preferred? I understand that the accumulator-style is more memory-efficient, but if we restrict the discussion to performance alone, it is unclear, at least to me, which is the better choice.

I've thought about this a little further, and would have to side with the Wikipedia article on the superiority of the accumulator-style recursion in the general case. Not only does it reduce usage of stack/heap space, memory access is always going to be behind register access and can only be made more evident now that multicore is here. Still, HtDP proves that actual testing is required in all cases.

Solution

The answer will depend on the details of the Racket system. Here's my take on it.

There are two major differences between the naturally recursive version and the accumulator version. First, the accumulator version is written in a fashion that allows tail-call optimization. This helps make the accumulator version faster, as fewer stack frames need to be created. But that is the opposite of what is discussed in HtDP and that you've seen on your work computer.

The other difference is the order of multiplication. The naturally recursive version multiples the numbers from 1 to 20 in ascending order, that is

((((1 * 2) * 3) * … * 19) * 20)

The accumulator version multiplies the same numbers in descending order, that is

((((20 * 19) * 18) * … * 2) * 1)

Mathematically, these are the same, and the two factorial functions will give the same result. Nonetheless, this difference can matter. In particular, at any intermediate multiplication, the intermediate result will be larger for the latter calculation than for the former calculation.

The factorial of 20 is a big number. It won't fit in a 32 bit integer. That means that racket will need to use an arbitrary length integer (a "bignum") to represent the answer, and some of the intermediate results. Arbitrary precision arithmetic, including multiplication involving bignums, is slower than fixed precision arithmetic.

Since the intermediate results in the accumulator version are always larger than for the naturally recursive version, the accumulator version will require a bignum earlier than the recursive version. In short, while both versions require the same number of multiplications, the accumulator version requires more arbitrary precision multiplications. This makes the accumulator version slower. Apparently, the additional cost of the arithmetic outweighs the saving from reducing the number of stack frames.

So why wouldn't the same trend show up on your home computer? You said it was an Intel iMac, so it is probably a 64 bit system. While 20! is a big number, it is small compared to what will fit in a 64 bit integer, so your home computer isn't doing any arbitrary precision arithmetic and the order doesn't matter. HtDP is old enough that it would have used a 32 bit system, as would Windows XP on your work computer.

More useful to explore the differences would be a function that calculates the product of a list of numbers, either

(define (product numlist)
  (* (car numlist) (product (cdr numlist)))

or an accumulator version. You could then feed in the numbers in either ascending or descending order, independent of whether you're using a naturally recursive or accumulator-based approach.