I noticed a strange performance hit from a minor refactoring that replaced a loop with a call to the builtin max
inside a recursive function.
Here's the simplest reproduction I could produce:
import time
def f1(n):
if n <= 1:
return 1
best = 0
for k in (1, 2):
current = f1(n-k)*n
if current > best:
best = current
return best
def f2(n):
if n <= 1:
return 1
return max(f2(n-k)*n for k in (1, 2))
t = time.perf_counter()
result1 = f1(30)
print('loop:', time.perf_counter() - t) # 0.45 sec
t = time.perf_counter()
result2 = f2(30)
print('max:', time.perf_counter() - t) # 1.02 sec
assert result1 == result2
Both f1
and f2
calculate factorial using the standard recursion but with an unnecessary maximization added in (just so I get to use max
in a recursion while still keeping the recursion simple):
# pseudocode
factorial(0) = 1
factorial(1) = 1
factorial(n) = max(factorial(n-1)*n, factorial(n-2)*n)
It's implemented without memoization, so there's an exponential number of calls.
The implementation with max(iterable)
is more than twice slower than the one with the loop.
Oddly, a direct comparison of (edit: never mind, see @TimPeters answer). Also, if I use max
vs loop did not demonstrate the effectmax(a, b)
instead of max(iterable)
the performance mismatch disappears.
This is really unfair for the max
function due to the generator expression you're feeding it.
For every invocation of f2
a new closure needs to be created for n
, a new function needs to be made (that's the way generator expressions, and list expressions since Python 3 I believe, are implemented; see 'The Details' of PEP 289) that wraps up the code object representing the gen-exp. Then this function, which iteratively calls other functions, again gets called.
A tiny snippet of the byte-code in question:
14 LOAD_CLOSURE 0 (n)
16 BUILD_TUPLE 1
18 LOAD_CONST 2 (<code object <genexpr> at 0x7f1b667e1f60, file "", line 16>)
20 LOAD_CONST 3 ('f2.<locals>.<genexpr>')
22 MAKE_FUNCTION 8
24 LOAD_CONST 5 ((1, 2))
26 GET_ITER
28 CALL_FUNCTION 1
You, of course, don't see any instructions like these in f1
's case since it is just doing calls.
When you then call your max
function, f2
, a significant number of times, as you're doing when recursively calculating the factorial of 30
, the overhead just piles up.
A list comprehension version of the function thing pretty much suffers from the same issue. It is a bit faster because list comprehensions are faster than generator expressions.
If I use
max(a, b)
instead ofmax(iterable)
the performance mismatch disappears.
Exactly, no functions are created for every call in this case so you're not seeing that overhead pile up. You're simply providing the arguments here.