Search code examples
cachingmemoryx86-64intrinsics

Write-combining: which cache line is avoided to be read before written?


Regarding non-temporal writes and write-combining techniques, I have the following code

void setbytes(char *p, int c)
{
__m128i i = _mm_set_epi8(c, c, c, c,
c, c, c, c,
c, c, c, c,
c, c, c, c);
_mm_stream_si128((__m128i *)&p[0], i);
_mm_stream_si128((__m128i *)&p[16], i);
_mm_stream_si128((__m128i *)&p[32], i);
_mm_stream_si128((__m128i *)&p[48], i);
}

taken from here

It is written that

To summarize, this code sequence not only avoids reading the cache line before it is written, it also avoids polluting the cache with data which might not be needed soon. This can have huge benefits in certain situations.

My question is: which cache line is avoided to be written? The cache line that stores the content of the i variable or the cache line where the p pointer points (which gets modified afterwards)?


Solution

  • about: "avoids reading the cache line before it is written"

    This statement refers to the 'write allocate' policy for handling writes that miss the cache. All modern x86 processors do this. It goes like this: Software writes to memory using a normal mov instruction. If that address is already cached, then cache is updated and there is no DRAM access at all. However, if the data is not in cache, the processor reads that cache line from DRAM. Then the data from the mov instruction is merged into the data in cache. The processor will postpone writing that data back out to DRAM for as long as possible. The end result is counter-intuitive: software executes a write (mov) instruction, and a single DRAM read (burst) results. If this pattern repeats, the cache will eventually become full and evictions will be needed to make room for the reads. In that case, there will be a DRAM write burst of an unrelated cache line address followed be read of the address the software is writing. This explains why non-temporal stores give roughly 2X the performance for filling a large buffer. Only half as many DRAM accesses occur when compared to using mov to fill the buffer.