c++c++11 clang compiler-optimization thread-local-storage

thread_local strange inlining

So for the following code, I see a strange inlining/optimisation artefact.

I'm curious to know if they are "necessary" in some hypothetical scenario I'm not appreciating?

Godbolt: https://godbolt.org/z/M8PW1obE7

#include <cstdint>
#include <stdio.h>
struct ThreadStateLogger
{
    static thread_local struct Instance
    {
        char TLS_byteLoc[3]    {' ', 0, 0};
        uint8_t TLS_numBytes    {1};
        // 4 wasted bytes here...
        char* TLS_byteLocPtr {TLS_byteLoc};

        void Log(char v)
        {
            TLS_byteLocPtr[0] = v;
        }

        void Log(char v1, char v2)
        {
            TLS_byteLocPtr[0] = v1;
            if (TLS_numBytes>1)
                TLS_byteLocPtr[1] = v2;
        }
    } instance;

    static void Log(char v1, char v2)
    {
        instance.Log(v1, v2);
    }

    // static void Log(char v1, char v2)
    // {
    //     instance.TLS_byteLocPtr[0] = v1;
    //     if (instance.TLS_numBytes>1)
    //         instance.TLS_byteLocPtr[1] = v2;
    // }

};
extern ThreadStateLogger theThreadStateLogger;
int main()
{
    ThreadStateLogger::Log('a', 'b');
    // printf("Hello world");
    ThreadStateLogger::Log('c', 'd');
    return 0;
}

The whole primary implementation gets inlined with -O3, which is what I want :-)

So it appears that the first Log() call correctly checks if this TLS needs allocating, and then gets the converted address using __tls_get_addr@PLT, which is all good.

The second Log() call also apparently checks if the object needs initialising, but then uses the cached addresses from the first call (rbx)! So if it did initialise, that could be wrong?

Below is the output from clang16 on godbolt, which is comparable to gcc - the same reinitialization test and cached address, but it does somewhat better than my current clang10 with -fPIC. https://godbolt.org/z/M8PW1obE7

main:                                   # @main
        push    rbx
        cmp     qword ptr [rip + _ZTHN17ThreadStateLogger8instanceE@GOTPCREL], 0
        je      .LBB0_2
        call    TLS init function for ThreadStateLogger::instance@PLT
.LBB0_2:
        data16
        lea     rdi, [rip + ThreadStateLogger::instance@TLSGD]
        data16
        data16
        rex64
        call    __tls_get_addr@PLT
        mov     rbx, rax
        mov     rax, qword ptr [rax + 8]
        mov     byte ptr [rax], 97
        cmp     byte ptr [rbx + 3], 2
        jae     .LBB0_3
        cmp     qword ptr [rip + _ZTHN17ThreadStateLogger8instanceE@GOTPCREL], 0
        jne     .LBB0_5
.LBB0_6:
        mov     rax, qword ptr [rbx + 8]
        mov     byte ptr [rax], 99
        cmp     byte ptr [rbx + 3], 2
        jae     .LBB0_7
.LBB0_8:
        xor     eax, eax
        pop     rbx
        ret
.LBB0_3:
        mov     rax, qword ptr [rbx + 8]
        mov     byte ptr [rax + 1], 98
        cmp     qword ptr [rip + _ZTHN17ThreadStateLogger8instanceE@GOTPCREL], 0
        je      .LBB0_6
.LBB0_5:
        call    TLS init function for ThreadStateLogger::instance@PLT
        mov     rax, qword ptr [rbx + 8]
        mov     byte ptr [rax], 99
        cmp     byte ptr [rbx + 3], 2
        jb      .LBB0_8
.LBB0_7:
        mov     rax, qword ptr [rbx + 8]
        mov     byte ptr [rax + 1], 100
        xor     eax, eax
        pop     rbx
        ret

edits

Removed a printf - ?MORE? init checks have been added, but __tls_get_addr is still cached in rbx.

Hmm It looks like an attempt to optimise the compares with TLS_numBytes in at least 1 code path. I guess that is because I have no complexity between the two calls, so that seems like a valid optimisation but wouldn't be useful in practice.

So the runtime cost is only a repeated check of the TLS-singleton-initialised flag, which we know is set. There is the code bloat as well, which is annoying.

tldr

So reminder of the question: Why is the init check/call repeated? If that is necessary, then why does the address not need to be regenerated? Or is this just an optimisation nobody has thought of doing? Is there any way to get a better optimisation by juggling the code pattern? I have had a few goes to get to here, as you can see.

Of course, I need the functionality of a logging class with this behaviour here, and the use of a class to encapsulate the 3 TLS variables means they are all constructed together instead of 1-at-a-time.

Solution

LLVM and likely GCC are unable to optimize this. Devirtualization, thread-local storage initialization, and other language features can be optimized locally within a function, but once they are inlined into another, there is no further potential.

The problem becomes clear when we look at the LLVM IR that clang outputs for this:

static void Log(char c) {
    // we suffer the same issue with one and two chars, this just makes our IR shorter
    instance.Log(c);
}

After all optimization passes, this function becomes:

define linkonce_odr void @ThreadStateLogger::Log(char)(i8 noundef signext %v1) local_unnamed_addr #1 comdat align 2 {
entry:
  br i1 icmp ne (ptr @TLS init function for ThreadStateLogger::instance, ptr null), label %0, label %TLS wrapper function for ThreadStateLogger::instance.exit

0:                                                ; preds = %entry
  call void @TLS init function for ThreadStateLogger::instance()
  br label %TLS wrapper function for ThreadStateLogger::instance.exit

TLS wrapper function for ThreadStateLogger::instance.exit:          ; preds = %entry, %0
  %1 = call align 8 ptr @llvm.threadlocal.address.p0(ptr align 8 @ThreadStateLogger::instance)
  %TLS_byteLocPtr.i = getelementptr inbounds %"struct.ThreadStateLogger::Instance", ptr %1, i64 0, i32 2
  %2 = load ptr, ptr %TLS_byteLocPtr.i, align 8
  store i8 %v1, ptr %2, align 1
  ret void
}

The important thing is that this function contains:

call void @TLS init function for ThreadStateLogger::instance() (demangled)
call void @_ZTHN17ThreadStateLogger8instanceE() (mangled)

It's not a magic function, and the compiler has no way of knowing that calling it once makes future calls unnecessary. The function call is located in a branch which reads from mutable global memory:

cmp qword ptr [rip + _ZTHN17ThreadStateLogger8instanceE@GOTPCREL], 0

and there is no IR-level information that this memory must be set to != 0 by our call void @_ZTHN17ThreadStateLogger8instanceE().

If you subsequently inline ThreadStateLogger::Log(char) into main, the two branches which contains this function call don't obsolete each other, and you see two instances of call TLS init function for ThreadStateLogger::instance@PLT

Why does `__tls_get_addr` get optimized then?

__tls_get_addr is @llvm.threadlocal.address.p0, it's likely subject to optimizations that a "random" call to a _ZTHN17ThreadStateLogger8instanceE function wouldn't have.

The call gets eliminated during an EarlyCSEPass in main. See compiler explorer with LLVM Optimization Pipeline.

thread_local strange inlining

Why does __tls_get_addr get optimized then?

Why does `__tls_get_addr` get optimized then?