python performance python-2.7 iteration nested-function

Mapping a single function is slower than twice mapping two separate functions?

The following example seems to imply run time optimisation that I do not understand

Can anyone explain this behavior and how it may apply to more generic cases?

Example

Consider the following simple (example) functions

def y(x): # str output
    y = 1 if x else 0
    return str(y)

def _y(x): # no str
    y = 1 if x else 0
    return y

Assume I want to apply the function y upon all elements in a list

l = range(1000) # test input data

Result

A map operation will have to iterate through all elements in the list. It seems counter intuitive that breaking the function apart into a double map significantly outperforms the single map function

%timeit map(str, map(_y, l))
1000 loops, best of 3: 206 µs per loop

%timeit map(y, l)
1000 loops, best of 3: 241 µs per loop

More generically, this also applies to non standard library nested functions for example

def f(x):
    return _y(_y(x))

%timeit map(_y, map(_y, l))
1000 loops, best of 3: 235 µs per loop
%timeit map(f, l)
1000 loops, best of 3: 294 µs per loop

Is this a python overhead issue where map is compiling the low level python code where possible and consequently being throttled when it has to interpret a nested function?

Solution

The difference is that map() is implemented in C code, and calling to other C-implemented functions is cheap, while calling to Python code is expensive. On top of that, calling out to other callables from Python code is expensive too:

>>> timeit.timeit('f(1)', 'def f(x): return str(x)')
0.21682000160217285
>>> timeit.timeit('str(1)')
0.140916109085083

and thirdly, you are passing in the function objects to map() (so no further lookups are done), but y() has to look up the str name each and every time. Global lookups are relatively expensive compared to local lookups; binding a global to a function argument to make it a local can help offset this a little:

>>> timeit.timeit('f(1)', 'def f(x, _str=str): return _str(x)')
0.19425392150878906

Much closer to the str(1) version, even though that had to use a global too; it can still beat the function call hands-down if you gave the time test a local too:

>>> timeit.timeit('_str(1)', '_str = str')
0.10266494750976562

So, Python bytecode execution requires an additional object, a stack frame, to be created for each call. That stack frame object has to be managed on a dedicated Python call stack when calling out to other code. Moreover, your y function looks up the str name as a global each time, whereas the map(str, ...) call keeps a single reference to that object and uses it over and over again.

By moving the str() call out of the y function and letting map() call str() directly instead through the single reference, you removed the stack handling and global name lookups, and sped things up marginally.

As a diagram, map(y, l) executes, per input value:

Create stackframe for y, execute the body
- look up str as a global
  - push y stackframe onto stack
  - execute str(...)
  - pop stackframe from stack
- return result

whereas map(str, map(_y, l)) executes

Create stackframe for _y
- return result
execute str(...)

The same applies to your f() function setup:

>>> def f(x):
...     return _y(_y(x))
...
>>> timeit.timeit("map(_y, map(_y, l))", 'from __main__ import _y, testdata as l', number=10000)
2.691640853881836
>>> timeit.timeit("map(f, l)", 'from __main__ import f, testdata as l', number=10000)
3.104063034057617

Calling map() on _y twice is faster than nesting the _y(_y(x)) call in another function, which then has to do global name look-ups and stress the Python stack some more; in your f() example each map() iteration has to create 3 stack frames and push and pop these on and off the stack, while in your map(_y, map(_y, ...)) setup, only 2 frames are created per iterated item:

Create stackframe for f, execute the body
- look up _y as a global
  - push f stackframe onto stack
  - Create stackframe for _y, execute the body
  - pop stackframe from stack
- look up _y as a global (yes, again)
  - push f stackframe onto stack
  - Create stackframe for _y, execute the body
  - pop stackframe from stack
- return result

versus:

Create stackframe for _y, execute the body
- return result
Create stackframe for _y, execute the body
- return result

Again, using locals can off-set the difference a little:

>>> def f(x, _y=_y):
...     return _y(_y(x))
...
>>> timeit.timeit("map(f, l)", 'from __main__ import f, testdata as l', number=10000)
2.981696128845215

but that extra Python frame object is still hobbling the single map(f, ...) call.

TLDR: Your y() function suffers from O(N) additional global name lookups and O(N) additional stack frame objects being pushed onto and off the Python stack, compared to the double map() version.

If speed matters this match, try to avoid creating Python stack frames and global name lookups in a tight loop.