Search code examples
c++optimizationx86false-sharingstdmutex

Why is sizeof std::mutex == 40 when cache line size is often 64 bytes


Following static_assert passes in both gcc and clang trunk.

#include<mutex>
int main(){
    static_assert(sizeof(std::mutex)==40);
}

Since x86 CPUs have 64 byte cache line I was expecting mutex sizeof to be 64, so false-sharing can be avoided. Is there a reason why size is "only" 40 bytes?

note: I know size also costs performance but rarely there is a huge number of mutexes in a program so size overhead seems to be negligible compared to cost of false sharing.

note:there is a similar question asking why std::mutex is so large, I am asking why is it so small :)

edit: MSVC 16.7 has sizeof 80.


Solution

  • Forcing padding where it's not needed would be bad design. Users can always pad if they have nothing useful to put in the rest of the cache line.

    You probably want it in the same cache line as the data it's protecting if it's usually lightly contended; only one cache line to bounce around, instead of a 2nd cache miss when accessing the shared data after acquiring the lock. This is probably common with fine-grained locking where many objects have their own std::mutex, and makes it more beneficial to keep it small.

    (Heavily contended could create false sharing between readers trying to acquire the lock vs. the lock owner writing to the shared data after gaining ownership of the lock. Flipping the cache line to "shared", or invalidating, before the lock owner has a chance to write, would indeed slow things down).


    Or the space in the rest of the line could be some very-rarely-used thing that needs to exist somewhere in the program, but maybe only used for error handling so its performance doesn't matter. If it couldn't share a line with a mutex, it would have to be taking up space somewhere else. (Maybe in some page of "cold" data, so this isn't a great example).

    It's probably unlikely that you'd want to malloc or new a mutex itself, although one could be part of a class you dynamically allocate. Allocator overhead is a real thing, e.g. using 16 bytes of memory before the allocation for bookkeeping space. (Large allocations with glibc's malloc/new are often page-aligned + 16 bytes, making them misaligned wrt. all wider boundaries). Dynamic-allocator bookkeeping is a very good thing for a mutex to be sharing space with: it's probably not read or written by anything while the mutex is in use.


    Non-lock-free std::atomic objects typically use an array of locks (maybe just simple spinlocks, but could be std::mutex). If the latter, you don't expect two adjacent mutexes to be used simultaneously so it's good to pack them all together.


    Also, increasing its size would be a very clunky way to try to ensure no false sharing. An implementation that wanted to make sure a std::mutex had a cache line to itself this would want to declare it with alignas(64) to make sure its alignof() was that. That would force padding to make sizeof(mutex) a multiple of alignof (in this case equal).

    But note that std::hardware_destructive_interference_size should be 128 on some modern x86-64, if you're going to fix a size for it, because of adjacent-line hardware prefetch in Intel's L2 caches. That's a weaker destructive effect than same cache-line, and that's too much space to waste.