I think that my question is not a duplicate of this question, which describes how to find which thread owns a pthread_mutex_t. I want to know how to find the owner of an internal glibc lock, which I don't think are the same as pthread_mutex_t. Consider the following thread from a backtrace taken from my application which is hanging (deadlocked, maybe?):
Thread 1 (Thread 0x7f8478e1b700 (LWP 24662)):
#0 0x00007f847cf277fc in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00007f847cea350c in _L_lock_5314 () from /lib64/libc.so.6
#2 0x00007f847ce9c108 in _int_free () from /lib64/libc.so.6
... <unnecessary details which lead up to this thread calling free> ...
#11 0x00007f847db18ea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f847cf19b0d in clone () from /lib64/libc.so.6
Can I use any tricks similar to the linked question or this other webpage to figure out what thread is holding the internal glibc lock?
Additional details:
Note: This backtrace was generated from a core file from a client, so I can't simply test gdb commands myself, otherwise I would have tried playing around with it some more.
There are a bunch of other threads in the program; I won't copy them all, but I think that this thread which is calling fork() is holding the lock:
Thread 4 (Thread 0x7f8479724700 (LWP 24775)):
#0 0x00007f847cee0b12 in fork () from /lib64/libc.so.6
...
#6 0x00007f847db18ea5 in start_thread () from /lib64/libpthread.so.0
#7 0x00007f847cf19b0d in clone () from /lib64/libc.so.6
I suspect that fork() probably needs the same lock as free(), but I'd like to determine that for sure and learn any more information that I can about the internal libc state of things.
The "XY" part of this question (what I really want to know) is why thread 4 is seemingly not progressing. I don't know, and I'm working with the client to learn more, but they said the program was stuck here for 3 hours before they noticed and killed it. I wanted to focus this question on debugging internal glibc stuff with gdb.
Note: I am aware of the dangers of using fork() in a multi-threaded process: the child must only call async-signal-safe functions before calling exec(). My issue is not concerning the child anyway, it is concerning the parent.
Unfortunately for you, you're right, internal glibc locks are not pthread_mutex_t
. They are simply futexes, i.e. just integers (pthread mutexes are also implemented with futexes under the hood, but wrapped in a structure that holds more information). This doc-comment at the top of the internal lowlevellock.h
in glibc explains how they are implemented in more detail. Since we are dealing with raw futexes, you cannot use the trick you mention to know the owner.
Depending on the kind of futex used, it might be more or less simple to debug your current situation. There are two categories:
{set,get}_robust_list
) to work properly.In case you are dealing with robust futexes, you can inspect the futex word (easy to do this in GDB) and recover the TID of the owner.
Unfortunately (x2) it seems to me that the lock you are dealing with is not a robust futex, but a normal one. In particular __lll_lock_wait_private()
does not seem to be dealing with TIDs and compares/stores small constants (such as 1
or 2
) in the futex word (one would have to check the exact glibc version you are dealing with, which you can do changing the version in the previous link, but still).
If I am correct and this is the case, I don't really see many options, the only few I can think about are:
strace -f -e futex
to detect futex
syscalls and reason about the output. Inserting debugging print statements in each thread will help understanding which TID is which, and looking at the flags passed to futex
will help understand who is waiting on which futex. In the case of a deadlock, you may see a thread stuck waiting on a futex
syscall forever.futex
syscalls: see e.g. this tutorial. Not really much different than the previous case, but would allow you to debug the program while also tracing it through eBPF, which you cannot do if you are using strace
(a process can only be traced by one tracer at a time). You could potentially also do this in a production environment since eBPF tracepoints are effectively separate and don't interfere with tracees.Ultimately what I would recommend though is start by adding as many debug prints as possible while ensuring that the issue is still reproducible, and then start narrowing things down from there.