Search code examples
linuxx86-64shared-memorymmapmemory-barriers

Is the memory returned from mmapping /dev/shm Write-Back (WB) or Non-Cacheable Write-Combining (WC) on Linux/x86?


I have two C++ processes that communicate via a memory-mapped Single-Producer Single-Consumer (SPSC) double buffer. The processes will only ever run on Linux/Intel x86-64. The semantics are that the producer fills the front buffer and then swaps pointers and updates a counter, letting the consumer know that it can memcpy() the back buffer. All shared state is stored in a header block at the start of the mmapped region.

int _fd;
volatile char *_mappedBuffer;

...

_fd = shm_open("/dev/shm/ipc_buffer", O_CREAT | O_TRUNC | O_RDWR, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH);
...
_mappedBuffer = static_cast<char *>(mmap(nullptr, _shmFileSizeBytes, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE | MAP_POPULATE, _fd, 0));

The producer needs a StoreStore barrier to ensure the swap is visible before the counter increment, which should be implicit on x86, with Write-Back (WB) memory:

void produce() {
    ...

    // swap pointers
    char *tmp = _frontBuffer;
    _frontBuffer= _backBuffer;
    _backBuffer= tmp;

    ...

    // SFENCE needed here? Yes if uncacheable WC, NO if WB due to x86 ordering guarantees?
    asm volatile ("sfence" ::: "memory");

    _flipCounter++;
}

The consumer needs a LoadLoad barrier if (WC) to ensure it loads the flip counter before the new back buffer pointer. If the memory is (WB), then we know the CPU can't re-order the loads:

bool consume(uint64_t &localFlipVer, char *dst) {
    if (localFlipVer < _flipCounter) {
        // LFENCE needed here? Yes if uncacheable WC, NO if WB due to x86 ordering guarantees?
        asm volatile ("lfence" ::: "memory");

        std::memcpy(dst, _backBuffer, _bufferSize);
        localFlipVer++;
        return true;
    }

    return false;
}

My question and my assumptions:

Is the memory-mapped region returned by mmapping /dev/shm Write-Back or Non-cacheable Write-Combining? If the latter, the stores and loads are weakly ordered and don't follow the traditional x86 ordering guarantees (No StoreStore or LoadLoad re-orderings) according to

https://hadibrais.wordpress.com/2019/02/26/the-significance-of-the-x86-sfence-instruction/

https://preshing.com/20120913/acquire-and-release-semantics/#IDComment721195741

https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/596002

and therefore, I'd have to use SFENCE and LFENCE, whereas normally (with WB), I could get away with just a compiler barrier asm volatile ("" ::: "memory");


Solution

  • /dev/shm/ is just a tmpfs mount point, like /tmp.

    Memory you mmap in files there is normal WB cacheable, just like MAP_ANONYMOUS. It follows the normal x86 memory-ordering rules (program order + a store buffer with store forwarding) so you don't need SFENCE or LFENCE, only blocking compile-time reordering for acq_rel ordering. Or for seq_cst, MFENCE or a locked operation, like using xchg to store.

    You can use C11 <stdatomic.h> functions on pointers into SHM, for types that are lock_free. (Normally any power-of-2 size up to pointer width.)

    Non-lock-free objects use a hash table of locks in the address-space of the process doing the operation, so separate processes won't respect each other's locks. 16-byte objects may still use lock cmpxchg16b which is address-free and works across processes, even though GCC7 and later reports it as non-lock-free for reasons even if you compile with -mcx16.


    I don't think there is a way on a mainstream Linux kernel for user-space to allocate memory of any type other than WB. (Other than the X server or direct-rendering clients mapping video RAM; I mean no way to map ordinary DRAM pages with a different PAT memory type.) See also When use write-through cache policy for pages

    Any type other than WB would be a potential performance disaster for normal code that doesn't try to batch stores up into one wide SIMD store. e.g. if you had a data structure in SHM protected by a shared mutex, it would suck if the normal accesses inside the critical section were uncacheable. Especially in the uncontended case where the same thread is repeatedly taking the same lock and reading/writing the same data.

    So there's very good reason why it's always WB.