Search code examples
c++c++11thread-safetystatic-initialization

Cost of thread-safe local static variable initialization in C++11?


We know that local static variable initialization is thread-safe in C++11, and modern compilers fully support this. (Is local static variable initialization thread-safe in C++11?)

What is the cost of making it thread-safe? I understand that this could very well be compiler implementation dependent.

Context: I have a multi-threaded application (10 threads) accessing a singleton object pool instance via the following function at very high rates, and I'm concerned about its performance implications.

template <class T>
ObjectPool<T>* ObjectPool<T>::GetInst()
{
    static ObjectPool<T> instance;
    return &instance;
}

Solution

  • A look at the generated assembler code helps.

    Source

    #include <vector>
    
    std::vector<int> &get(){
      static std::vector<int> v;
      return v;
    }
    int main(){
      return get().size();
    }
    

    Assembler

    std::vector<int, std::allocator<int> >::~vector():
            movq    (%rdi), %rdi
            testq   %rdi, %rdi
            je      .L1
            jmp     operator delete(void*)
    .L1:
            rep ret
    get():
            movzbl  guard variable for get()::v(%rip), %eax
            testb   %al, %al
            je      .L15
            movl    get()::v, %eax
            ret
    .L15:
            subq    $8, %rsp
            movl    guard variable for get()::v, %edi
            call    __cxa_guard_acquire
            testl   %eax, %eax
            je      .L6
            movl    guard variable for get()::v, %edi
            movq    $0, get()::v(%rip)
            movq    $0, get()::v+8(%rip)
            movq    $0, get()::v+16(%rip)
            call    __cxa_guard_release
            movl    $__dso_handle, %edx
            movl    get()::v, %esi
            movl    std::vector<int, std::allocator<int> >::~vector(), %edi
            call    __cxa_atexit
    .L6:
            movl    get()::v, %eax
            addq    $8, %rsp
            ret
    main:
            subq    $8, %rsp
            call    get()
            movq    8(%rax), %rdx
            subq    (%rax), %rdx
            addq    $8, %rsp
            movq    %rdx, %rax
            sarq    $2, %rax
            ret
    

    Compared to

    Source

    #include <vector>
    
    static std::vector<int> v;
    std::vector<int> &get(){
      return v;
    }
    int main(){
      return get().size();
    }
    

    Assembler

    std::vector<int, std::allocator<int> >::~vector():
            movq    (%rdi), %rdi
            testq   %rdi, %rdi
            je      .L1
            jmp     operator delete(void*)
    .L1:
            rep ret
    get():
            movl    v, %eax
            ret
    main:
            movq    v+8(%rip), %rax
            subq    v(%rip), %rax
            sarq    $2, %rax
            ret
            movl    $__dso_handle, %edx
            movl    v, %esi
            movl    std::vector<int, std::allocator<int> >::~vector(), %edi
            movq    $0, v(%rip)
            movq    $0, v+8(%rip)
            movq    $0, v+16(%rip)
            jmp     __cxa_atexit
    

    I'm not that great with assembler, but I can see that in the first version v has a lock around it and get is not inlined whereas in the second version get is essentially gone.
    You can play around with various compilers and optimization flags, but it seems no compiler is able to inline or optimize out the locks, even though the program is obviously single threaded.
    You can add static to get which makes gcc inline get while preserving the lock.

    To know how much these locks and additional instructions cost for your compiler, flags, platform and surrounding code you would need to make a proper benchmark.
    I would expect the locks to have some overhead and be significantly slower than the inlined code, which becomes insignificant when you actually do work with the vector, but you can never be sure without measuring.