how to copy multiple data elements between CPUs using cacheline atomicity?

I'm trying to implement an atomic copy for multiple data elements between CPUs. I packed multiple elements of data into a single cacheline to manipulate them atomically. So I wrote the following code.

In this code, (compiled with -O3) I aligned a global struct data into a single cacheline, and I set the elements in a CPU followed by a store barrier. It is to make globally visible from the other CPU.

At the same time, in the other CPU, I used an load barrier to access the cacheline atomically. My expectation was that the reader (or consumer) CPU should bring a cache line of data into the its own cache hierarchy L1, L2 etc.. So, since I do not use load barrier again until the next read, the elements of the data would be the same, but it does not work as expected. I can't keep the cacheline atomicity in this code. The writer CPU seems putting elements into the cacheline piece by piece. How could it be possible?

#include <emmintrin.h>
#include <pthread.h>
#include "common.h"

#define CACHE_LINE_SIZE             64

struct levels {
    uint32_t x1;
    uint32_t x2;
    uint32_t x3;
    uint32_t x4;
    uint32_t x5;
    uint32_t x6;
    uint32_t x7;
} __attribute__((aligned(CACHE_LINE_SIZE)));

struct levels g_shared;

void *worker_loop(void *param)
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(15, &cpuset);

    pthread_t thread = pthread_self();

    int status = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
    fatal_relog_if(status != 0, status);

    struct levels shared;
    while (1) {

        _mm_lfence();
        shared = g_shared;

        if (shared.x1 != shared.x7) {
            printf("%u %u %u %u %u %u %u\n",
                    shared.x1, shared.x2, shared.x3, shared.x4, shared.x5, shared.x6, shared.x7);
            exit(EXIT_FAILURE);
        }
    }

    return NULL;
}

int main(int argc, char *argv[])
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(16, &cpuset);

    pthread_t thread = pthread_self();

    memset(&g_shared, 0, sizeof(g_shared));

    int status = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
    fatal_relog_if(status != 0, status);

    pthread_t worker;
    int istatus = pthread_create(&worker, NULL, worker_loop, NULL);
    fatal_elog_if(istatus != 0);

    uint32_t val = 0;
    while (1) {
        g_shared.x1 = val;
        g_shared.x2 = val;
        g_shared.x3 = val;
        g_shared.x4 = val;
        g_shared.x5 = val;
        g_shared.x6 = val;
        g_shared.x7 = val;

        _mm_sfence();
        // _mm_clflush(&g_shared);

        val++;
    }

    return EXIT_SUCCESS;
}

The output is like below

3782063 3782063 3782062 3782062 3782062 3782062 3782062

UPDATE 1

I updated the code as below using AVX512, but the problem is still here.

#include <emmintrin.h>
#include <pthread.h>
#include "common.h"
#include <immintrin.h>

#define CACHE_LINE_SIZE             64

/**
 * Copy 64 bytes from one location to another,
 * locations should not overlap.
 */
static inline __attribute__((always_inline)) void
mov64(uint8_t *dst, const uint8_t *src)
{
        __m512i zmm0;

        zmm0 = _mm512_load_si512((const void *)src);
        _mm512_store_si512((void *)dst, zmm0);
}

struct levels {
    uint32_t x1;
    uint32_t x2;
    uint32_t x3;
    uint32_t x4;
    uint32_t x5;
    uint32_t x6;
    uint32_t x7;
} __attribute__((aligned(CACHE_LINE_SIZE)));

struct levels g_shared;

void *worker_loop(void *param)
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(15, &cpuset);

    pthread_t thread = pthread_self();

    int status = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
    fatal_relog_if(status != 0, status);

    struct levels shared;
    while (1) {
        mov64((uint8_t *)&shared, (uint8_t *)&g_shared);
        // shared = g_shared;

        if (shared.x1 != shared.x7) {
            printf("%u %u %u %u %u %u %u\n",
                    shared.x1, shared.x2, shared.x3, shared.x4, shared.x5, shared.x6, shared.x7);
            exit(EXIT_FAILURE);
        } else {
            printf("%u %u\n", shared.x1, shared.x7);
        }
    }

    return NULL;
}

int main(int argc, char *argv[])
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(16, &cpuset);

    pthread_t thread = pthread_self();

    memset(&g_shared, 0, sizeof(g_shared));

    int status = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
    fatal_relog_if(status != 0, status);

    pthread_t worker;
    int istatus = pthread_create(&worker, NULL, worker_loop, NULL);
    fatal_elog_if(istatus != 0);

    uint32_t val = 0;
    while (1) {
        g_shared.x1 = val;
        g_shared.x2 = val;
        g_shared.x3 = val;
        g_shared.x4 = val;
        g_shared.x5 = val;
        g_shared.x6 = val;
        g_shared.x7 = val;

        _mm_sfence();
        // _mm_clflush(&g_shared);

        val++;
    }

    return EXIT_SUCCESS;
}

Solution

I used an load barrier to access the cacheline atomically

No, barriers do not create atomicity. They only order your own operations, not stop operations from other threads from appearing between two of our own.

Non-atomicity happens when another thread's store becomes visible between two of our loads. lfence does nothing to stop that.

lfence here is pointless; it just makes the CPU running this thread stall until it drains its ROB/RS before executing the loads. (lfence serializes execution, but has no effect on memory ordering unless you're using NT loads from WC memory e.g. video RAM).

Your options are:

Recognize that this is an X-Y problem and do something that doesn't require 64-byte atomic loads/stores. e.g. atomically update a pointer to non-atomic data. The general case of that is RCU, or perhaps a lock-free queue using a circular buffer.

Use a software lock to get logical atomicity (like _Atomic struct levels g_shared; with C11) for threads that agree to cooperate by respecting the lock.

A SeqLock might be a good choice for this data if it's read more often than it changes, or especially with a single writer and multiple readers. Readers retry when tearing may have been possible; check a sequence number before/after the read, using sufficient memory-ordering. See Implementing 64 bit atomic counter with 32 bit atomics for a C++11 implementation; C11 is easier because C allows assignment from a volatile struct to a non-volatile temporary.

Or hardware-supported 64-byte atomicity:

Intel transactional memory (TSX) available on some CPUs. This would even let you do an atomic RMW on it, or atomically read from one location and write to another. But more complex transactions are more likely to abort. Putting 4x 16-byte or 2x 32-byte loads into a transaction should hopefully not abort very often even under contention. Safe for grouping stores into a separate transaction. (Hopefully the compiler is smart enough to end the transaction with the loaded data still in registers, so it doesn't have to be atomically stored to a local on the stack, too.)

There are GNU C/C++ extensions for transactional memory. https://gcc.gnu.org/wiki/TransactionalMemory
AVX512 (allowing a full-cache-line load or store) on a CPU which happens to implement it in a way that makes aligned 64-byte loads/stores atomic. There's no on-paper guarantee that anything wider than an 8-byte load/store is ever atomic on x86, except for lock cmpxchg16b and movdir64b.

In practice we're fairly sure that modern Intel CPUs like Skylake transfer whole cache-lines atomically between cores, unlike AMD. And we know that on Intel (not AMD) a vector load or store that doesn't cross a cache-line boundary does make a single access to L1d cache, transferring all the bits in the same clock cycle. So an aligned vmovaps zmm, [mem] on Skylake-avx512 should in practice be atomic, unless you have an exotic chipset that glues many sockets together in a way that creates tearing. (Multi-socket K10 vs. single-socket K10 is a good cautionary tale: Why is integer assignment on a naturally aligned variable atomic on x86?)
MOVDIR64B - only atomic for the store part, and only supported on Intel Tremont (next-gen Goldmont successor). This still doesn't give you a way to do a 64-byte atomic load. Also it's a cache-bypassing store so not good for inter-core communication latency. I think the use-case is generating a full-size PCIe transaction.

See also SSE instructions: which CPUs can do atomic 16B memory operations? re: lack of atomicity guarantees for SIMD load/store. CPU vendors have for some reason not chosen to provide any written guarantees or ways to detect when SIMD loads/stores will be atomic, even though testing has shown that they are on many systems (when you don't cross a cache-line boundary.)