Search code examples
cachingoptimizationcpux86-64mesi

performance for writing the same value again into cache line


I sometimes see optimized code like this:

if (matrix[t] != 0) {
    matrix[t] = 0;
}

As opposed to just this code:

matrix[t] = 0;

I suppose this code is written this way to reduce memory bandwidth in the CPU. Is this a good optimization on a typical CPU (when the value is likely to be 0 already) and why?

What does this mean for the MESI-state: Is there a state transition from e.g. shared to modified if I write the same value back into a cache line (write but no modification)? Or would this be too complicated for the CPU to detect?

Are typical CPUs (or at least some) optimizing anything about this case?


Solution

  • AFAIK, no x86 microarchitectures attempt to commit a store from the store buffer to L1D by reading while it's still in Shared MESI state and checking if the value matches.

    It's usually going to be rare, and only worth the extra cache-access cycles for hot shared variables, so it doesn't make sense for a microarchitecture to do it by default. Most stores are not to shared variables, and within the store buffer it doesn't know which stores are to shared variables or not.


    In cases where that's worth doing (i.e. sometimes for shared variables), you have to do it yourself with code like the if() in the question. That is exactly what that code is for, and yes it can be a win.

    It's a good idea to avoid writing shared variables if there's a good chance some other thread has read it more recently than you last wrote it, because it always Invalidates all other copies to get the local CPU's line into Modified state.

    In some cases the cost of the load + branch mispredict might be higher than the saving, especially if it doesn't predict well. (A speculative RFO might even invalidate other copies before the mispredict is detected. Of course a speculative store can't actually commit to L1D, but the read for ownership can happen AFAIK.)

    As another example, in the retry loop of a spinlock you always want to spin on a pure load (+ pause), not on xchg. Spinning on xchg or lock cmpxchg will keep hammering on that cache line and delay the code that's actually unlocking it.


    Intel's optimization manual even suggests this optimization in the TSX chapter, to reduce transaction aborts in other threads that are acessing the shared variable by avoiding unnecessary stores.

    // Example 12-1
    state = true; // updates every time
    var |= flag;
    
    vs.
    
    if (state != true) state = true;
    if (!(var & flag)) var |= flag;
    

    With TSX, transaction aborts have even higher costs than just extra waiting for MESI, so the chance of it being worth it is probably higher.