performance x86 intel memory-barriers prefetch

Why does using MFENCE with store instruction block prefetching in L1 cache?

I have an object of 64 byte in size:

typedef struct _object{
  int value;
  char pad[60];
} object;

in main I am initializing array of object:

volatile object * array;
int arr_size = 1000000;
array = (object *) malloc(arr_size * sizeof(object));

for(int i=0; i < arr_size; i++){
    array[i].value = 1;
    _mm_clflush(&array[i]);
}
_mm_mfence();

Then loop again through each element. This is the loop I am counting events for:

int tmp;
for(int i=0; i < arr_size-105; i++){
    array[i].value = 2;
    //tmp = array[i].value;
     _mm_mfence();
 }

having mfence does not make any sense here but I was tying something else and accidentally found that if I have store operation, without mfence I get half million of RFO requests (measured by papi L2_RQSTS.ALL_RFO event), which means that another half million was L1 hit, prefetched before demand. However including mfence results in 1 million RFO requests, giving RFO_HITs, that means that cache line is only prefetched in L2, not in L1 cache anymore.

Besides the fact that Intel documentation somehow indicates otherwise: "data can be brought into the caches speculatively just before, during, or after the execution of an MFENCE instruction." I checked with load operations. without mfence I get up to 2000 L1 hit, whereas with mfence, I have up to 1 million L1 hit (measured with papi MEM_LOAD_RETIRED.L1_HIT event). The cache lines are prefetched in L1 for load instruction.

So it should not be the case that including mfence blocks prefetching. Both the store and load operations take almost the same time - without mfence 5-6 msec, with mfence 20 msec. I went through other questions regarding mfence but it's not mentioned what is expected behavior for it with prefetching and I don't see good enough reason or explanation why it would block prefetching in L1 cache with only store operations. Or I might be missing something for mfence description?

I am testing on Skylake miroarchitecture, however checked with Broadwell and got the same result.

Solution

It's not L1 prefetching that causes the counter values you see: the effect remains even if you disable the L1 prefetchers. In fact, the effect remains if you disable all prefetchers except the L2 streamer:

wrmsr -a 0x1a4 "$((2#1110))"

If you do disable the L2 streamer, however, the counts are as you'd expect: you see roughly 1,000,000 L2.RFO_MISS and L2.RFO_ALL even without the mfence.

First, it is important to note that the L2_RQSTS.RFO_* events count do not count RFO events originating from the L2 streamer. You can see the details here, but basically the umask for each of the 0x24 RFO events are:

name      umask
RFO_MISS   0x22
RFO_HIT    0x42
ALL_RFO    0xE2

Note that none of the umask values have the 0x10 bit which indicates that events which originate from the L2 streamer should be tracked.

It seems like what happens is that when the L2 streamer is active, many of the events that you might expect to be assigned to one of those events are instead "eaten" by the L2 prefetcher events instead. What likely happens is that the L2 prefetcher is running ahead of the request stream, and when the demand RFO comes in from L1, it finds a request already in progress from the L2 prefetcher. This only increments again the umask |= 0x10 version of the event (indeed I get 2,000,000 total references when including that bit), which means that RFO_MISS and RFO_HIT and RFO_ALL will miss it.

It's somewhat analogous to the "fb_hit" scenario, where L1 loads neither miss nor hit exactly, but hit an in-progress load - but the complication here is the load was initiated by the L2 prefetcher.

The mfence just slows everything down enough that the L2 prefetcher almost always has time to bring the line all the way to L2, giving an RFO_HIT count.

I don't think the L1 prefetchers are involved here at all (shown by the fact that this works the same if you turn them off): as far as I know L1 prefetchers don't interact with stores, only loads.

Here are some useful perf commands you can use to see the difference in including the "L2 streamer origin" bit. Here's w/o the L2 streamer events:

perf stat --delay=1000 -e cpu/event=0x24,umask=0xef,name=l2_rqsts_references/,cpu/event=0x24,umask=0xe2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xc2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x22,name=l2_rqsts_rfo_miss/

and with them included:

perf stat --delay=1000 -e cpu/event=0x24,umask=0xff,name=l2_rqsts_references/,cpu/event=0x24,umask=0xf2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xd2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x32,name=l2_rqsts_rfo_miss/

I ran these against this code (with the sleep(1) lining up with the --delay=1000 command passed to perf to exclude the init code):

#include <time.h>
#include <immintrin.h>
#include <stdio.h>
#include <unistd.h>

typedef struct _object{
  int value;
  char pad[60];
} object;

int main() {
    volatile object * array;
    int arr_size = 1000000;
    array = (object *) malloc(arr_size * sizeof(object));

    for(int i=0; i < arr_size; i++){
        array[i].value = 1;
        _mm_clflush((const void*)&array[i]);
    }
    _mm_mfence();

    sleep(1);
    // printf("Starting main loop after %zu ms\n", (size_t)clock() * 1000u / CLOCKS_PER_SEC);

    int tmp;
    for(int i=0; i < arr_size-105; i++){
        array[i].value = 2;
        //tmp = array[i].value;
        // _mm_mfence();
    }
}