I've got a custom implementation of detours on macOS and a test application using it, which is written in C, compiled for macOS x86_64, running on an Intel i9 processor.
The implemention works fine with a multitude of functions. However, if I detour pthread_create
, I encounter strange behaviour: threads that have been spawned via a detoured pthread_create do not execute instructions. I can step through instructions one by one but as soon as I continue
it does not progress. There are no mutexes or synchronisation involved and the result of the function is 0 (success). The exact same application with detours turned off works fine so it's unlikely to be the culprit.
This does not happen all the time - sometimes they are fine but at other times the test applications stalls in the following state:
(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00007fff7296f55e libsystem_kernel.dylib`__ulock_wait + 10
frame #1: 0x00007fff72a325c2 libsystem_pthread.dylib`_pthread_join + 347
frame #2: 0x0000000100001186 DetoursTestApp`main + 262
frame #3: 0x00007fff7282ccc9 libdyld.dylib`start + 1
frame #4: 0x00007fff7282ccc9 libdyld.dylib`start + 1
thread #2
frame #0: 0x00007fff72a2cb7c libsystem_pthread.dylib`thread_start
Relevant memory pages have the executable flag set. The detour function that intercepts the thread creation looks like this:
static int pthread_create_detour(pthread_t* thread,
const pthread_attr_t* attr,
void* (*start_routine)(void*),
void* arg)
{
detour_count++;
pthread_fn original = (pthread_fn)detour_original(dlsym((void*)-1, "pthread_create"));
return original(thread, attr, start_routine, arg);
}
Where detour_original
retrieves the pointer to [original function + size of function's prologue].
Tracing through the instructions, everything seems to be working correctly and pthread_create
terminates successfully. Tracing the application's system calls via dtruss
does show calls to
bsdthread_create(0x10DB964B0, 0x0, 0x7000080DB000) = 29646848 0
With what I have confirmed are the correct arguments.
This behaviour is only observed in release builds - debug works fine but the disassembly and execution of a detoured pthread_create
and associated detours code seems to be identical in both cases.
I found a couple of odd workarounds for this issue that don't make much sense. Given the detour function, a number of things can be substituted into the following:
static int pthread_create_detour(pthread_t* thread,
const pthread_attr_t* attr,
void* (*start_routine)(void*),
void* arg)
{
detour_count++;
pthread_fn original = (pthread_fn)detour_original(dlsym((void*)-1, "pthread_create"));
<...> <== SUBSTITUTE HERE
return original(thread, attr, start_routine, arg);
}
__asm__ __volatile__("" ::: "memory");
_mm_clflush(real_pthread_create);
usleep(1)
printf
statement.void *data = malloc(40000);
.All of these seem to point to a stale instruction cache. However, the Intel manual states the following:
A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction.
What's even more interesting is that those workarounds have to be executed for every new thread created, with the execution happening on the main thread, so it's very unlikely to be the cache. I have also tried putting in cache flushes at every memory write that writes instructions but that did not help. I've also written a memcpy
that bypasses the cache with the use of Intel's intrinsic _mm_stream_si32
and swapped it out for every instruction memory write in my implementation without any success.
The next suspect in line is a race condition. However, it's not clear what would be racing as at first there are no other threads. I have put in a fibonacci sequence calculation for a randomly-generated number and that would still stall the newly-spawned threads.
What is causing this issue? What other mechanisms could be responsible for this?
At this point I have run out of things to check so any suggestions will be welcome.
I found that the reason why the spawned thread was not executing instructions was that the r8
register wasn't being cleared at the right time in the execution of pthread_create
due to an issue with my detours implementation.
If we look at the disassembly of the function, it is split up to two parts - the "head" and the "body" that's found in an internal _pthread_create
function. The head does two things - zeroes out r8
and jumps to the body:
libsystem_pthread.dylib`pthread_create:
0x7fff72a2e236 <+0>: 45 31 c0 xor r8d, r8d
0x7fff72a2e239 <+3>: e9 40 37 00 00 jmp 0x7fff72a3197e ; _pthread_create
libsystem_pthread.dylib`_pthread_create:
0x7fff72a3197e <+0>: 55 push rbp
0x7fff72a3197f <+1>: 48 89 e5 mov rbp, rsp
0x7fff72a31982 <+4>: 41 57 push r15
<...> // the rest of the 1409 instructions
My implementation would detour the internal _pthread_create
function instead of the head containing the actual entry point which meant that the r8
would get cleared at the wrong time (before the detour). Since the detour function would contain some could, the execution would go something like:
pthread_create
(r8
gets cleared) -> _pthread_create
-> chain of jumps -> pthread_create_detour
-> trampoline (containing the beginning of _pthread_create
) -> _pthread_create + 6
Which meant that depending on the contents of the pthread_create_detour
function the r8
would not always end up with a 0 when it returned to the internal function.
It's not yet clear why having r8
set to something other than 0 before _pthread_create
would not crash but instead start up a thread in a locked up state. An important detail is that the stalled thread would have the rflags
register set to 0x200
which should never be the case according to Intel's manual. This is what lead me to inspecting the CPU state more closely, leading to the answer.