Search code examples
cx86cpu-cachesse2clflush

The right way to use function _mm_clflush to flush a large struct


I am starting to use functions like _mm_clflush, _mm_clflushopt, and _mm_clwb.

Say now as I have defined a struct name mystruct and its size is 256 Bytes. My cacheline size is 64 Bytes. Now I want to flush the cacheline that contains the mystruct variable. Which of the following way is the right way to do so?

_mm_clflush(&mystruct)

or

for (int i = 0; i < sizeof(mystruct)/64; i++) {

     _mm_clflush( ((char *)&mystruct) + i*64)

}

Solution

  • The clflush CPU instruction doesn't know the size of your struct; it only flushes exactly one cache line, the one containing the byte pointed to by the pointer operand. (The C intrinsic exposes this as a const void*, but char* would also make sense, especially given the asm documentation which describes it as an 8-bit memory operand.)

    You need 4 flushes 64 bytes apart, or maybe 5 if your struct isn't alignas(64) so it could have parts in 5 different lines. (You could unconditionally flush the last byte of the struct, instead of using more complex logic to check if it's in a cache line you haven't flushed yet, depending on relative cost of clflush vs. more logic and a possible branch mispredict.)

    Your original loop did 4 flushes of 4 adjacent bytes at the start of your struct.
    It's probably easiest to use pointer increments so the casting is not mixed up with the critical logic.

    // first attempt, a bit clunky:
        const int LINESIZE = 64;
        const char *lastbyte = (const char *)(&mystruct+1) - 1;
        for (const char *p = (const char *)&mystruct; p <= lastbyte ; p+=LINESIZE) {
             _mm_clflush( p );
        }
        // if mystruct is guaranteed aligned by 64, you're done.  Otherwise not:
    
        // check if next line to maybe flush contains the last byte of the struct; if not then it was already flushed.
        if( ((uintptr_t)p ^ (uintptr_t)lastbyte) & -LINESIZE == 0 )
            _mm_clflush( lastbyte );
    

    x^y is 1 in bit-positions where they differ. x & -LINESIZE discards the offset-within-line bits of the address, keeping only the line-number bits. So we can see if 2 addresses are in the same cache line or not with just XOR and TEST instructions. (Or clang optimizes that to a shorter cmp instruction).

    Or rewrite that into a single loop, using that if logic as the termination condition:

    I used a C++ struct foo &var reference so I could follow your &var syntax but still see how it compiles for a function taking a pointer arg. Adapting to C is straightforward.

    Looping over every cache line of an arbitrary size unaligned struct

    /* I think this version is best: 
      * compact setup / small code-size
      * with no extra latency for the initial pointer
      * doesn't need to peel a final iteration
    */
    inline
    void flush_structfoo(struct foo &mystruct) {
        const int LINESIZE = 64;
        const char *p = (const char *)&mystruct;
        uintptr_t endline = ((uintptr_t)&mystruct + sizeof(mystruct) - 1) | (LINESIZE-1);
        // set the offset-within-line address bits to get the last byte 
        // of the cacheline containing the end of the struct.
    
        do {   // flush while p is in a cache line that contains any of the struct
             _mm_clflush( p );
              p += LINESIZE;
        } while(p <= (const char*)endline);
    }
    

    With GCC10.2 -O3 for x86-64, this compiles nicely (Godbolt)

    flush_v3(foo&):
            lea     rax, [rdi+255]
            or      rax, 63
    .L11:
            clflush [rdi]
            add     rdi, 64
            cmp     rdi, rax
            jbe     .L11
            ret
    

    GCC doesn't unroll, and doesn't optimize any better if you use alignas(64) struct foo{...}; unfortunately. You might use if (alignof(mystruct) >= 64) { ... } to check if special handling is needed to let GCC optimize better, otherwise just use end = p + sizeof(mystruct); or end = (const char*)(&mystruct+1) - 1; or similar.

    (In C, #include <stdalign.h> for #define for alignas() and alignof() like C++, instead of ISO C11 _Alignas and _Alignof keywords.)


    Another alternative is this, but it's clunkier and takes more setup work.

        const int LINESIZE = 64;
        uintptr_t line = (uintptr_t)&mystruct & -LINESIZE;
        uintptr_t lastline = ((uintptr_t)&mystruct + sizeof(mystruct) - 1) & -LINESIZE;
        do {               // always at least one flush; works on small structs
             _mm_clflush( (void*)line );
              line += LINESIZE;
        } while(line < lastline);
    

    A struct that was 257 bytes would always touch exactly 5 cache lines, no checking needed. Or a 260-byte struct that's known to be aligned by 4. IDK if we can get GCC to optimize away the checks based on that.