Why does it take longer to execute a simple for loop in Python 3.12 than in Python 3.11?

Python 3.12 was released two days ago with several new features and improvements. It claims that it's faster than ever, so I decided to give it a try. I ran a few of my scripts with the new version, but it was slower than before. I tried various approaches, simplifying my code each time in an attempt to identify the bottleneck that was causing it to run slowly. However, I have been unsuccessful so far. Finally, I decided to test a simple for loop like the following:

import time

def calc():
    for i in range(100_000_000):
        x = i * 2

t = time.time()
calc()
print(time.time() - t)

On my machine, it took 4.7s on Python 3.11.5 and 5.7s on Python 3.12.0. Trying on other machines had similar results.

So why is it slower in the latest version of Python?

Solution

I am able to reproduce the observed behavior between CPython 3.11.2 and CPython 3.12.0rc2 on Debian Linux 6.1.0-6 using an Intel i5-9600KF CPU. I tried to use a low-level profiling approach so to find the differences. Put it shortly: your benchmark is very specific and CPython 3.12 is less optimized for this specific case. CPython 3.12 seems to manage object allocations, and more specifically range a bit differently. CPython 3.12 appears to create a new object from the constant 2 for every iteration of the loop as opposed to CPython 3.11. Moreover, the main evaluation function do an indirect function pointer call which is particularly slow in this case. Anyway, you should not use (C)Python in such a use-case (this is stated in the CPython doc).

Under the hood

Here is the results I get (pretty stable between multiple launches):

3.11: 2.026395082473755
3.12: 2.4122846126556396

Thus, CPython 3.12 is roughly 20% slower than CPython 3.11 on my machine.

Profiling results indicates that half the overhead comes from an indirect function pointer call in the main evaluation function of CPython 3.12 which was not present in CPython 3.11. This function call is expensive on most modern processors. Here is the assembly code of the hot part:

       │      ↓ je         4b8
  1,28 │        mov        (%rdi),%rax
  0,33 │        test       %eax,%eax
  0,61 │      ↓ js         4b8
  0,02 │        sub        $0x1,%rax
  2,80 │        mov        %rax,(%rdi)
       │      ↓ jne        4b8
  0,08 │        mov        0x8(%rdi),%rax
 16,28 │      → call       *0x30(%rax)            <---------------- HERE
       │        nop
  1,53 │ 4b8:   mov        (%rsp),%rax
       │        lea        0x4(%rax),%rcx
       │        movzbl     0x3(%rax),%eax
  0,06 │        mov        0x48(%r13,%rax,8),%rdx
  1,82 │        mov        (%rdx),%eax
  0,04 │        add        $0x1,%eax
       │      ↓ je         8800

While the assembly code of the same function in CPython 3.11 is similar, it does not have such expensive call. Still, there are many similar indirect function calls like this already in CPython 3.11. My hypothesis is that such a call is more expensive in CPython 3.12 because it is less predictable by the hardware prediction unit (possibly because the same instruction calls multiple functions). For more information about that, please read this great post. I cannot say much more about this part since the assembly code is really HUGE (and it turns out the C code is also pretty big).

The rest of the overhead seems to come from the way object and more specifically constant are managed in CPython 3.12. Indeed, In CPython 3.11, PyObject_Free calls are slow (because you spent all your time creating/deleting objects), while in CPython 3.12, such a call is not even visible in the profiler, but there is instead PyLong_FromLong which is quite slow (not visible in CPython 3.11). The rest of the (many other) functions only takes less than 25~30% and look similar in the two versions. Based on that, we can conclude that CPython 3.12 creates a new object from the constant 2 for each iteration of the loop as opposed to CPython 3.11. This is clearly not efficient in this case (one should keep in mind that CPython is an interpreter and not a compiler though so it is not surprising it does not perform optimizations on this). There is a simple way to check that: store 2 in a variable before the loop and use this variable in the loop. Here is the corrected code:

import time

def calc():
    const = 2
    for i in range(100_000_000):
        x = i * const

t = time.time()
calc()
print(time.time() - t)

Here are the timings of the corrected code:

3.11: 2.045902967453003
3.12: 2.2230796813964844

The CPython 3.12 version is now substantially faster than before while the other version is almost unaffected by the modification of the code. At first glance, it tends to confirm the last hypothesis. That being said, the profiler still report many calls to PyLong_FromLong in the modified code! It turns out this change removed the issue related to the indirect function pointer call discussed in the beginning of this section!

My hypothesis is that the PyLong_FromLong calls are coming from a different way to manage the objects generated from range (i.e. i). The following code tends to confirm that (note the code require ~4 GiB of RAM due to the list so it should not be used in production but only for testing purposes):

import time

def calc():
    const = 2
    TMP = list(range(100_000_000))
    t = time.time()
    for i in TMP:
        x = i * const
    print(time.time() - t)

calc()

Here are results on my machine:

3.11: 1.6515681743621826
3.12: 1.7162528038024902

The gap is closer than before and the timings of the loop is smaller since objects are all pre-computed in the list before. Profiling results confirm PyLong_FromLong is not called in the timed loop. Thus, range is slower in this case in CPython 3.12.

The rest of the overhead is small (<4%). Such a performance gap can come from compiler optimizations or even very tiny changes in the CPython source code. For example, simple things like the address of conditional jumps can significantly impact performance results on many Intel CPUs (see: JCC erratum). Tiny details like this matters and compilers are not perfect. This is why such a variation of performance is common and rather expected so it's not worth investigating.

By the way, if you care about performance, then please use Cython or PyPy for such a computing code.