Search code examples
cmultithreadinggdbglibc

gdb - Can you find the thread holding an internal glibc lock?


I think that my question is not a duplicate of this question, which describes how to find which thread owns a pthread_mutex_t. I want to know how to find the owner of an internal glibc lock, which I don't think are the same as pthread_mutex_t. Consider the following thread from a backtrace taken from my application which is hanging (deadlocked, maybe?):

Thread 1 (Thread 0x7f8478e1b700 (LWP 24662)):
#0  0x00007f847cf277fc in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007f847cea350c in _L_lock_5314 () from /lib64/libc.so.6
#2  0x00007f847ce9c108 in _int_free () from /lib64/libc.so.6
... <unnecessary details which lead up to this thread calling free> ...
#11 0x00007f847db18ea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f847cf19b0d in clone () from /lib64/libc.so.6

Can I use any tricks similar to the linked question or this other webpage to figure out what thread is holding the internal glibc lock?

Additional details:

Note: This backtrace was generated from a core file from a client, so I can't simply test gdb commands myself, otherwise I would have tried playing around with it some more.

There are a bunch of other threads in the program; I won't copy them all, but I think that this thread which is calling fork() is holding the lock:

Thread 4 (Thread 0x7f8479724700 (LWP 24775)):
#0  0x00007f847cee0b12 in fork () from /lib64/libc.so.6
...
#6  0x00007f847db18ea5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f847cf19b0d in clone () from /lib64/libc.so.6

I suspect that fork() probably needs the same lock as free(), but I'd like to determine that for sure and learn any more information that I can about the internal libc state of things.

The "XY" part of this question (what I really want to know) is why thread 4 is seemingly not progressing. I don't know, and I'm working with the client to learn more, but they said the program was stuck here for 3 hours before they noticed and killed it. I wanted to focus this question on debugging internal glibc stuff with gdb.

Note: I am aware of the dangers of using fork() in a multi-threaded process: the child must only call async-signal-safe functions before calling exec(). My issue is not concerning the child anyway, it is concerning the parent.


Solution

  • Unfortunately for you, you're right, internal glibc locks are not pthread_mutex_t. They are simply futexes, i.e. just integers (pthread mutexes are also implemented with futexes under the hood, but wrapped in a structure that holds more information). This doc-comment at the top of the internal lowlevellock.h in glibc explains how they are implemented in more detail. Since we are dealing with raw futexes, you cannot use the trick you mention to know the owner.

    Depending on the kind of futex used, it might be more or less simple to debug your current situation. There are two categories:

    • Normal futexes. These are usually implemented as small integers, they do not track the owner by themselves (they may do so with the help of other things, but the futex word itself does not hold owner information).
    • Robust futexes (see this introduction to robust futexes and the robust futex ABI documentation). These do track owners as a mechanism to recover from exceptional situations such as crashes while a futex is being held. The futex word holds the owner thread ID optionally ORed with some flags (see the previous doc links). Their management is a bit more complex than the one for a normal futex, and they also require a couple more syscalls ({set,get}_robust_list) to work properly.

    In case you are dealing with robust futexes, you can inspect the futex word (easy to do this in GDB) and recover the TID of the owner.

    Unfortunately (x2) it seems to me that the lock you are dealing with is not a robust futex, but a normal one. In particular __lll_lock_wait_private() does not seem to be dealing with TIDs and compares/stores small constants (such as 1 or 2) in the futex word (one would have to check the exact glibc version you are dealing with, which you can do changing the version in the previous link, but still).

    If I am correct and this is the case, I don't really see many options, the only few I can think about are:

    • If you cannot run tests yourself and only have access to core dumps, distribute a version of your application compiled with debug symbols, add some logging to also collect along with the core dumps, and if that is not possible maybe add some internal logic to write per-thread information to a certain area in memory so that it will be included in the core dumps.
    • If you can run the application and replicate the problem locally, try using strace -f -e futex to detect futex syscalls and reason about the output. Inserting debugging print statements in each thread will help understanding which TID is which, and looking at the flags passed to futex will help understand who is waiting on which futex. In the case of a deadlock, you may see a thread stuck waiting on a futex syscall forever.
    • Or try writing an eBPF filter to trace futex syscalls: see e.g. this tutorial. Not really much different than the previous case, but would allow you to debug the program while also tracing it through eBPF, which you cannot do if you are using strace (a process can only be traced by one tracer at a time). You could potentially also do this in a production environment since eBPF tracepoints are effectively separate and don't interfere with tracees.

    Ultimately what I would recommend though is start by adding as many debug prints as possible while ensuring that the issue is still reproducible, and then start narrowing things down from there.