Search code examples
openmppython-c-apipython-3.12

Python3.12 C-API segfaults with openMP


Here is a small c++ program that embeds python.

It works with python 3.11.6, but segfaults with python 3.12.0:

#include <iostream>
#include "omp.h"
#include "Python.h"

int main()
{
    Py_Initialize();
    
    #pragma omp parallel
    {
        #pragma omp single
        {
            std::cout << "One character:"<<std::endl;
            PyObject *nameobj1 = PyUnicode_FromString("a");
            std::cout << nameobj1 << std::endl;
            Py_DECREF(nameobj1);
            
            std::cout << "Two characters:"<<std::endl;
            PyObject *nameobj2 = PyUnicode_FromString("aa");
            std::cout << nameobj2 << std::endl;
            Py_DECREF(nameobj2);
        }
    }
    
    Py_Finalize();
}

Compiling and running with 3.11:

$ g++ pytest.cpp `python3.11-config --ldflags --cflags` -lpython3.11 -fopenmp
$ ./a.out 
One character:
0x730a12d466e0
Two characters:
0x730a121f33f0

Compiling and running with 3.12:

$ g++ pytest.cpp `python3.12-config --ldflags --cflags` -lpython3.12 -fopenmp
$ ./a.out 
One character:
0x734752e48a08
Two characters:
Segmentation fault (core dumped)

Has something changed in python 3.12 that prevents to use PyUnicode_FromString with more than 1 character, with openMP? Is there a workaround?

Remarks:

  • g++ 13.2.0
  • 2 openMP threads
  • it actually works when not using -fopenmp
  • Here is a backtrace using gdb:
#0  0x00007ffff77f3f80 in _PyInterpreterState_GET () at ../Include/internal/pycore_pystate.h:118
#1  get_state () at ../Objects/obmalloc.c:866
#2  _PyObject_Malloc (ctx=<optimized out>, nbytes=43) at ../Objects/obmalloc.c:1563
#3  0x00007ffff782b509 in PyUnicode_New (maxchar=<optimized out>, size=2) at ../Objects/unicodeobject.c:1208
#4  PyUnicode_New (size=2, maxchar=<optimized out>) at ../Objects/unicodeobject.c:1154
#5  0x00007ffff7837081 in unicode_decode_utf8 (s=<optimized out>, size=2, error_handler=_Py_ERROR_UNKNOWN, errors=0x0, consumed=0x0)
    at ../Objects/unicodeobject.c:4647
#6  0x0000555555555422 in main._omp_fn.0(void) () at pytest.cpp:19
#7  0x00007ffff7f6b48e in gomp_thread_start (xdata=<optimized out>) at ../../../src/libgomp/team.c:129
#8  0x00007ffff6e97b5a in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
#9  0x00007ffff6f285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Solution

  • Your code has a bug, you never acquire the GIL inside the child threads, you must acquire the GIL when creating or deleting (or modifying) any python object (with a few exceptions on the modify part), your code just didn't crash in python3.11 but crashes in python3.12

    Some of the interpreter state is threadlocal, and locking the GIL properly initializes this state.

    To acquire and drop the GIL use PyGILState_Ensure and PyGILState_Release respectively

    You also need to drop the GIL from the main thread before the parallel section to avoid deadlocks.

    i think the biggest change is the Per Interpreter GIL which was added in python3.12, which pushed more state into the threadlocal section, making your code crash, before this change your code was wrong but it wasn't crashing.