python performance python-3.x python-internals

Why does max(iterable) perform much slower than an equivalent loop?

I noticed a strange performance hit from a minor refactoring that replaced a loop with a call to the builtin max inside a recursive function.

Here's the simplest reproduction I could produce:

import time

def f1(n):
    if n <= 1:
        return 1
    best = 0
    for k in (1, 2):
        current = f1(n-k)*n
        if current > best:
            best = current
    return best

def f2(n):
    if n <= 1:
        return 1
    return max(f2(n-k)*n for k in (1, 2))


t = time.perf_counter()
result1 = f1(30)
print('loop:', time.perf_counter() - t) # 0.45 sec

t = time.perf_counter()
result2 = f2(30)
print('max:', time.perf_counter() - t) # 1.02 sec

assert result1 == result2

Both f1 and f2 calculate factorial using the standard recursion but with an unnecessary maximization added in (just so I get to use max in a recursion while still keeping the recursion simple):

# pseudocode
factorial(0) = 1
factorial(1) = 1
factorial(n) = max(factorial(n-1)*n, factorial(n-2)*n)

It's implemented without memoization, so there's an exponential number of calls.

The implementation with max(iterable) is more than twice slower than the one with the loop.

~~Oddly, a direct comparison of max vs loop did not demonstrate the effect~~ (edit: never mind, see @TimPeters answer). Also, if I use max(a, b) instead of max(iterable) the performance mismatch disappears.

Solution

This is really unfair for the max function due to the generator expression you're feeding it.

For every invocation of f2 a new closure needs to be created for n, a new function needs to be made (that's the way generator expressions, and list expressions since Python 3 I believe, are implemented; see 'The Details' of PEP 289) that wraps up the code object representing the gen-exp. Then this function, which iteratively calls other functions, again gets called.

A tiny snippet of the byte-code in question:

14 LOAD_CLOSURE             0 (n)
16 BUILD_TUPLE              1
18 LOAD_CONST               2 (<code object <genexpr> at 0x7f1b667e1f60, file "", line 16>)
20 LOAD_CONST               3 ('f2.<locals>.<genexpr>')
22 MAKE_FUNCTION            8
24 LOAD_CONST               5 ((1, 2))
26 GET_ITER
28 CALL_FUNCTION            1

You, of course, don't see any instructions like these in f1's case since it is just doing calls.

When you then call your max function, f2, a significant number of times, as you're doing when recursively calculating the factorial of 30, the overhead just piles up.

A list comprehension version of the function thing pretty much suffers from the same issue. It is a bit faster because list comprehensions are faster than generator expressions.

If I use max(a, b) instead of max(iterable) the performance mismatch disappears.

Exactly, no functions are created for every call in this case so you're not seeing that overhead pile up. You're simply providing the arguments here.