Search code examples
cmultithreadingpthreadsdeadlockglibc

How can I find out the source of this glibc backtrace originating with clone()?


This backtrace comes from a deadlock situation in a multi-threaded application. The other deadlocked threads are locking inside the call to malloc(), and appear to be waiting on this thread.

I don't understand what creates this thread, since it deadlocks before calling any functions in my application:

Thread 6 (Thread 0x7ff69d43a700 (LWP 14191)):
#0  0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00007ff6a299460d in _L_lock_27 () from /usr/lib64/libc.so.6
#2  0x00007ff6a29945bd in arena_thread_freeres () from /usr/lib64/libc.so.6
#3  0x00007ff6a2994662 in __libc_thread_freeres () from /usr/lib64/libc.so.6
#4  0x00007ff6a3875e38 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6

clone() is used to implement fork(), pthread_create(), and perhaps other functions. See here and here.

How can I find out if this trace comes from a fork(), pthread_create(), a signal handler, or something else? Do I just need to dig through the glibc code, or can I use gdb or some other tool? Why does this thread need the internal glibc lock? This would be useful in determining the cause of the deadlock.

Additional information and research:

malloc() is thread-safe, but not reentrant (recursive-safe) (see this and this, so malloc() is also not async-signal-safe. We don't define signal handlers for this process, so I know that we don't call malloc() from signal handlers. The deadlocked threads don't ever call recursive functions, and callbacks are handled in a new thread, so I don't think we should need to worry about reentrancy here. (Maybe I'm wrong?)

This deadlock happens when many callbacks are being spawned to signal (ultimately kill) different processes. The callbacks are spawned in their own threads.

Are we possibly using malloc in an unsafe way?

Possibly related:

glibc malloc internals

Malloc inside of signal handler causes deadlock.

How are signal handlers delivered in a multi-threaded application?

glibc fork/malloc deadlock bug that was fixed in glibc-2.17-162.el7. This looks similar, but is NOT my bug - I'm on a fixed version of glibc.

(I've been unsuccessful in creating a minimal, complete, verifiable example. Unfortunately the only way to reproduce is with the application (Slurm), and it's quite difficult to reproduce.)

EDIT: Here's the backtrace from all the threads. Thread 6 is the trace I originally posted. Thread 1 is just waiting on a pthread_join(). Threads 2-5 are locked after a call to malloc(). Thread 7 is listening for messages and spawning callbacks in new threads (threads 2-5). Those would be callbacks that would eventually signal other processes.

Thread 7 (Thread 0x7ff69e672700 (LWP 12650)):
#0  0x00007ff6a291aa3d in poll () from /usr/lib64/libc.so.6
#1  0x00007ff6a3c09064 in _poll_internal (shutdown_time=<optimized out>, nfds=2,
    pfds=0x7ff6980009f0) at ../../../../slurm/src/common/eio.c:364
#2  eio_handle_mainloop (eio=0xf1a970) at ../../../../slurm/src/common/eio.c:328
#3  0x000000000041ce78 in _msg_thr_internal (job_arg=0xf07760)
    at ../../../../../slurm/src/slurmd/slurmstepd/req.c:245
#4  0x00007ff6a3875e25 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6

Thread 6 (Thread 0x7ff69d43a700 (LWP 14191)):
#0  0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00007ff6a299460d in _L_lock_27 () from /usr/lib64/libc.so.6
#2  0x00007ff6a29945bd in arena_thread_freeres () from /usr/lib64/libc.so.6
#3  0x00007ff6a2994662 in __libc_thread_freeres () from /usr/lib64/libc.so.6
#4  0x00007ff6a3875e38 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6

Thread 5 (Thread 0x7ff69e773700 (LWP 22471)):
#0  0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00007ff6a28af7d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2  0x00007ff6a28a7ca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00007ff6a28ad0fe in malloc () from /usr/lib64/libc.so.6
#4  0x00007ff6a3c02e60 in slurm_xmalloc (size=size@entry=24, clear=clear@entry=false,
    file=file@entry=0x7ff6a3c1f1f0 "../../../../slurm/src/common/pack.c",
    line=line@entry=152, func=func@entry=0x7ff6a3c1f4a6 <__func__.7843> "init_buf")
    at ../../../../slurm/src/common/xmalloc.c:86
#5  0x00007ff6a3b2e5b7 in init_buf (size=16384)
    at ../../../../slurm/src/common/pack.c:152
#6  0x000000000041caab in _handle_accept (arg=0x0)
    at ../../../../../slurm/src/slurmd/slurmstepd/req.c:384
#7  0x00007ff6a3875e25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6

Thread 4 (Thread 0x7ff6a4086700 (LWP 5633)):
#0  0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00007ff6a28af7d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2  0x00007ff6a28a7ca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00007ff6a28ad0fe in malloc () from /usr/lib64/libc.so.6
#4  0x00007ff6a3c02e60 in slurm_xmalloc (size=size@entry=24, clear=clear@entry=false,
    file=file@entry=0x7ff6a3c1f1f0 "../../../../slurm/src/common/pack.c",
    line=line@entry=152, func=func@entry=0x7ff6a3c1f4a6 <__func__.7843> "init_buf")
    at ../../../../slurm/src/common/xmalloc.c:86
#5  0x00007ff6a3b2e5b7 in init_buf (size=16384)
    at ../../../../slurm/src/common/pack.c:152
#6  0x000000000041caab in _handle_accept (arg=0x0)
    at ../../../../../slurm/src/slurmd/slurmstepd/req.c:384
#7  0x00007ff6a3875e25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6

Thread 3 (Thread 0x7ff69d53b700 (LWP 12963)):
#0  0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00007ff6a28af7d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2  0x00007ff6a28a7ca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00007ff6a28ad0fe in malloc () from /usr/lib64/libc.so.6
#4  0x00007ff6a3c02e60 in slurm_xmalloc (size=size@entry=24, clear=clear@entry=false,
    file=file@entry=0x7ff6a3c1f1f0 "../../../../slurm/src/common/pack.c",
    line=line@entry=152, func=func@entry=0x7ff6a3c1f4a6 <__func__.7843> "init_buf")
    at ../../../../slurm/src/common/xmalloc.c:86
#5  0x00007ff6a3b2e5b7 in init_buf (size=16384)
    at ../../../../slurm/src/common/pack.c:152
#6  0x000000000041caab in _handle_accept (arg=0x0)
    at ../../../../../slurm/src/slurmd/slurmstepd/req.c:384
#7  0x00007ff6a3875e25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6

Thread 2 (Thread 0x7ff69f182700 (LWP 19734)):
#0  0x00007ff6a2932eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00007ff6a28af7d8 in _L_lock_1579 () from /usr/lib64/libc.so.6
#2  0x00007ff6a28a7ca0 in arena_get2.isra.3 () from /usr/lib64/libc.so.6
#3  0x00007ff6a28ad0fe in malloc () from /usr/lib64/libc.so.6
#4  0x00007ff6a3c02e60 in slurm_xmalloc (size=size@entry=24, clear=clear@entry=false,
    file=file@entry=0x7ff6a3c1f1f0 "../../../../slurm/src/common/pack.c",
    line=line@entry=152, func=func@entry=0x7ff6a3c1f4a6 <__func__.7843> "init_buf")
    at ../../../../slurm/src/common/xmalloc.c:86
#5  0x00007ff6a3b2e5b7 in init_buf (size=16384)
    at ../../../../slurm/src/common/pack.c:152
#6  0x000000000041caab in _handle_accept (arg=0x0)
    at ../../../../../slurm/src/slurmd/slurmstepd/req.c:384
#7  0x00007ff6a3875e25 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00007ff6a292534d in clone () from /usr/lib64/libc.so.6

Thread 1 (Thread 0x7ff6a4088880 (LWP 12616)):
#0  0x00007ff6a3876f57 in pthread_join () from /usr/lib64/libpthread.so.0
#1  0x000000000041084a in _wait_for_io (job=0xf07760)
    at ../../../../../slurm/src/slurmd/slurmstepd/mgr.c:2219
#2  job_manager (job=job@entry=0xf07760)
    at ../../../../../slurm/src/slurmd/slurmstepd/mgr.c:1397
#3  0x000000000040ca07 in main (argc=1, argv=0x7fffacab93d8)
    at ../../../../../slurm/src/slurmd/slurmstepd/slurmstepd.c:172

Solution

  • The presence of start_thread() in the backtrace indicates that this is a pthread_create() thread.

    __libc_thread_freeres() is a function that glibc calls at thread exit, which invokes a set of callbacks to free internal per-thread state. This indicates the thread you have highlighted is in the process of exiting.

    arena_thread_freeres() is one of those callbacks. It is for the malloc arena allocator, and it moves the free list from the exiting thread's private arena to the global free list. To do this, it must take a lock that protects the global free list (this is the list_lock in arena.c).

    It appears to be this lock that the highlighted thread (Thread 6) is blocked on.

    The arena allocator installs pthread_atfork() handlers which lock the list lock at the start of fork() processing, and unlock it at the end. This means that while other pthread_atfork() handlers are running, all other threads will block on this lock.

    Are you installing your own pthread_atfork() handlers? It seems likely that one of these may be causing your deadlock.