What caused Python 3.13-0b3 ( compiled with GIL disabled ) be slower than 3.12.0?

I did a simple performance test on python 3.12.0 against python 3.13.0b3 compiled with a --disable-gil flag. The program executes calculations of a Fibonacci sequence using ThreadPoolExecutor or ProcessPoolExecutor. The docs on the PEP introducing disabled GIL says that there is a bit of overhead mostly due to biased reference counting followed by per-object locking (https://peps.python.org/pep-0703/#performance). But it says the overhead on pyperformance benchmark suit is around 5-8%. My simple benchmark shows a significant difference in the performance. Indeed, python 3.13 without GIL utilize all CPUs with a ThreadPoolExecutor but it is much slower than python 3.12 with GIL. Based on the CPU utilization and the elapsed time we can conclude that with python 3.13 we do multiple times more clock cycles comparing to the 3.12.

Program code:

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import datetime
from functools import partial
import sys
import logging
import multiprocessing

logging.basicConfig(
    format='%(levelname)s: %(message)s',
)
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
cpus = multiprocessing.cpu_count()
pool_executor = ProcessPoolExecutor if len(sys.argv) > 1 and sys.argv[1] == '1' else ThreadPoolExecutor
python_version_str = f'{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}'
logger.info(f'Executor={pool_executor.__name__}, python={python_version_str}, cpus={cpus}')


def fibonacci(n: int) -> int:
    if n < 0:
        raise ValueError("Incorrect input")
    elif n == 0:
        return 0
    elif n == 1 or n == 2:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

start = datetime.datetime.now()

with pool_executor(8) as executor:
    for task_id in range(30):
        executor.submit(partial(fibonacci, 30))

    executor.shutdown(wait=True)

end = datetime.datetime.now()
elapsed = end - start
logger.info(f'Elapsed: {elapsed.total_seconds():.2f} seconds')

Test results:

# TEST Linux 5.15.0-58-generic, Ubuntu 20.04.6 LTS

INFO: Executor=ThreadPoolExecutor, python=3.12.0, cpus=2
INFO: Elapsed: 10.54 seconds

INFO: Executor=ProcessPoolExecutor, python=3.12.0, cpus=2
INFO: Elapsed: 4.33 seconds

INFO: Executor=ThreadPoolExecutor, python=3.13.0b3, cpus=2
INFO: Elapsed: 22.48 seconds

INFO: Executor=ProcessPoolExecutor, python=3.13.0b3, cpus=2
INFO: Elapsed: 22.03 seconds

Can anyone explain why do I experience such a difference when comparing the overhead to the one from pyperformance benchmark suit?

EDIT 1

I have tried with pool_executor(cpus) instead of pool_executor(8) -> still got the similar results.
I watched this video https://www.youtube.com/watch?v=zWPe_CUR4yU and executed the following test: https://github.com/ArjanCodes/examples/blob/main/2024/gil/main.py

Results:

Version of python: 3.12.0a7 (main, Oct  8 2023, 12:41:37) [GCC 9.4.0]
GIL cannot be disabled
Single-threaded: 78498 primes in 6.67 seconds
Threaded: 78498 primes in 7.89 seconds
Multiprocessed: 78498 primes in 5.85 seconds

Version of python: 3.13.0b3 experimental free-threading build (heads/3.13.0b3:7b413952e8, Jul 27 2024, 11:19:31) [GCC 9.4.0]
GIL is disabled
Single-threaded: 78498 primes in 61.42 seconds
Threaded: 78498 primes in 32.29 seconds
Multiprocessed: 78498 primes in 39.85 seconds

so yet another test on my machine when we end up with multiple times slower performance. Btw. On the video we can see the similar overhead results as it is described in the PEP.

EDIT 2

As @ekhumoro suggested I did configure the build with the following flags:
./configure --disable-gil --enable-optimizations
and it seems the --enable-optimizations flag makes a significant difference in the considered benchmarks. The previous build was done with the following configuration:
./configure --with-pydebug --disable-gil.

Tests results:

Fibonacci benchmark:

INFO: Executor=ThreadPoolExecutor, python=3.12.0, cpus=2
INFO: Elapsed: 10.25 seconds

INFO: Executor=ProcessPoolExecutor, python=3.12.0, cpus=2
INFO: Elapsed: 4.27 seconds

INFO: Executor=ThreadPoolExecutor, python=3.13.0, cpus=2
INFO: Elapsed: 6.94 seconds

INFO: Executor=ProcessPoolExecutor, python=3.13.0, cpus=2
INFO: Elapsed: 6.94 seconds

Prime numbers benchmark:

Version of python: 3.12.0a7 (main, Oct  8 2023, 12:41:37) [GCC 9.4.0]
GIL cannot be disabled
Single-threaded: 78498 primes in 5.77 seconds
Threaded: 78498 primes in 7.21 seconds
Multiprocessed: 78498 primes in 3.23 seconds

Version of python: 3.13.0b3 experimental free-threading build (heads/3.13.0b3:7b413952e8, Aug  3 2024, 14:47:48) [GCC 9.4.0]
GIL is disabled
Single-threaded: 78498 primes in 7.99 seconds
Threaded: 78498 primes in 4.17 seconds
Multiprocessed: 78498 primes in 4.40 seconds

So the general gain from moving from python 3.12 multiprocessing to python 3.12 no-gil multi-threading are significant memory savings (we do have only a single process).

When we compare CPU overhead for the machine with only 2 cores:

[Fibonacci] Python 3.13 multi-threading against Python 3.12 multiprocessing: (6.94 - 4.27) / 4.27 * 100% ~= 63% overhead

[Prime numbers] Python 3.13 multi-threading against Python 3.12 multiprocessing: (4.17 - 3.23) / 3.23 * 100% ~= 29% overhead

Solution

From the latest question edits, it seems the version of Python-3.13 used for testing was built with debug mode enabled and without optimisations enabled. The former flag in particular can have a large impact on performance testing, whilst the latter will have a much smaller, but still significant, impact. In general, it's best to avoid drawing any conclusions about performance issues when testing with development builds of Python.