Why are there too many demand rfo offcore responses /offcore requests?

Whiskey Lake i7-8565U/Ubuntu 18.04/HT enabled

Consider the following code that writes some garbage data that happened to be in registers ymm0 and ymm1 into 16 MiB statically allocated WB memory in a loop consisting of 6400 iteration (so page fault impact is negligible):

;rdx = 16MiB >> 3
xor rcx, rcx
store_loop:
vmovdqa [rdi + rcx*8], ymm0
vmovdqa [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja store_loop

Using taskset -c 3 ./bin I'm measuring RFO requests by this example and here is the results:

Performance counter stats for 'taskset -c 3 ./bin':

     1 695 029 000      L1-dcache-load-misses     # 2325,60% of all L1-dcache hits    (24,93%)
        72 885 527      L1-dcache-loads                                               (24,99%)
     3 411 237 144      L1-dcache-stores                                              (25,05%)
       946 374 671      l2_rqsts.all_rfo                                              (25,11%)
       451 047 123      l2_rqsts.rfo_hit                                              (25,15%)
       495 868 337      l2_rqsts.rfo_miss                                             (25,15%)
     2 367 931 179      l2_rqsts.all_pf                                               (25,14%)
       568 168 558      l2_rqsts.pf_hit                                               (25,08%)
     1 785 300 075      l2_rqsts.pf_miss                                              (25,02%)
     1 217 663 928      offcore_requests.demand_rfo                                     (24,96%)
     1 963 262 031      offcore_response.demand_rfo.any_response                                     (24,91%)
           108 536      dTLB-load-misses          #    0,20% of all dTLB cache hits   (24,91%)
        55 540 014      dTLB-loads                                                    (24,91%)
        26 310 618      dTLB-store-misses                                             (24,91%)
     3 412 849 640      dTLB-stores                                                   (24,91%)
    27 265 942 916      cycles                                                        (24,91%)

       6,681218065 seconds time elapsed

       6,584426000 seconds user
       0,096006000 seconds sys

The description of l2_rqsts.all_rfo is

Counts the total number of RFO (read for ownership) requests to L2 cache. L2 RFO requests include both L1D demand RFO misses as well as L1D RFO prefetches.

suggesting that DCU can do some sort of RFO prefetches. It was not clear from the desctiption of DCU from Intel Optimization Manual/2.6.2.4:

Data cache unit (DCU) prefetcher — This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.

So I guess that DCU follows the "access type": If it is RFO then DCU does RFO prefetch.

All of those RFO prefetches should go to L2 along with demand RFO and only some of them (l2_rqsts.rfo_miss) should go to the uncore. The offcore_requests.demand_rfo counts only demand rfo, but l2_rqsts.rfo_miss accounts all rfo (demand + dcu prefectch) meaning that the inequality offcore_requests.demand_rfo < l2_rqsts.rfo_miss should be held.

QUESTION 1: Why is l2_rqsts.rfo_miss much less then offcore_requests.demand_rfo (even l2_rqsts.all_rfo less then offcore_requests.demand_rfo)

I expected that demand offcore_requests.demand_rfo can be matched up with offcore_response.demand_rfo.any_response so there should be approximately equal numbers for those Core PMU events

QUESTION 2: Why is offcore_response.demand_rfo.any_response almost 1.5 times more then offcore_requests.demand_rfo?

I'm guessing that L2-streamer also does some RFO prefetches, but it should not be accounted in offcore_requests.demand_rfo anyway.

UPD:

$ sudo rdmsr -p 3 0x1A4
1

L2-Streamer off

 Performance counter stats for 'taskset -c 3 ./bin':

     1 672 633 985      L1-dcache-load-misses     # 2272,75% of all L1-dcache hits    (24,96%)
        73 595 056      L1-dcache-loads                                               (25,00%)
     3 409 928 481      L1-dcache-stores                                              (25,00%)
     1 593 190 436      l2_rqsts.all_rfo                                              (25,04%)
        16 582 758      l2_rqsts.rfo_hit                                              (25,07%)
     1 579 107 608      l2_rqsts.rfo_miss                                             (25,07%)
       124 294 129      l2_rqsts.all_pf                                               (25,07%)
        22 674 837      l2_rqsts.pf_hit                                               (25,07%)
       102 019 160      l2_rqsts.pf_miss                                              (25,07%)
     1 661 232 864      offcore_requests.demand_rfo                                     (25,02%)
     3 287 688 173      offcore_response.demand_rfo.any_response                                     (24,98%)
           139 247      dTLB-load-misses          #    0,25% of all dTLB cache hits   (24,94%)
        56 823 458      dTLB-loads                                                    (24,90%)
        26 343 286      dTLB-store-misses                                             (24,90%)
     3 384 264 241      dTLB-stores                                                   (24,94%)
    37 782 766 410      cycles                                                        (24,94%)

       9,320791474 seconds time elapsed

       9,213383000 seconds user
       0,099928000 seconds sys

As can be seen offcore_requests.demand_rfo got closer to l2_rqsts.rfo_miss, but still there is some difference. In the Intel docs of OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD I found the following:

Note: A prefetch promoted to Demand is counted from the promotion point.

So my guess is that L2-prefetches were promoted to Demand and counted in the Demand offcore requests. But it does not explain the difference between offcore_response.demand_rfo.any_response and offcore_requests.demand_rfo which is almost twice now:

offcore_requests.demand_rfo 1 661 232 864

offcore_response.demand_rfo.any_response 3 287 688 173

UPD:

$ sudo rdmsr -p 3 0x1A4
3

All L2 prefetchers off

 Performance counter stats for 'taskset -c 3 ./bin':

     1 686 560 752      L1-dcache-load-misses     # 2138,14% of all L1-dcache hits    (23,44%)
        78 879 830      L1-dcache-loads                                               (23,48%)
     3 409 552 015      L1-dcache-stores                                              (23,53%)
     1 670 187 931      l2_rqsts.all_rfo                                              (23,56%)
            15 674      l2_rqsts.rfo_hit                                              (23,59%)
     1 676 538 346      l2_rqsts.rfo_miss                                             (23,58%)
           156 206      l2_rqsts.all_pf                                               (23,59%)
            14 436      l2_rqsts.pf_hit                                               (23,59%)
           173 163      l2_rqsts.pf_miss                                              (23,59%)
     1 671 606 174      offcore_requests.demand_rfo                                     (23,59%)
     3 301 546 970      offcore_response.demand_rfo.any_response                                     (23,59%)
           140 335      dTLB-load-misses          #    0,21% of all dTLB cache hits   (23,57%)
        68 010 546      dTLB-loads                                                    (23,53%)
        26 329 766      dTLB-store-misses                                             (23,49%)
     3 429 416 286      dTLB-stores                                                   (23,45%)
    39 462 328 435      cycles                                                        (23,42%)

       9,699770319 seconds time elapsed

       9,596304000 seconds user
       0,099961000 seconds sys

Now the total number of prefetch requests to l2 (from all prefetchers) is 156 206 l2_rqsts.all_pf.

UPD:

$ sudo rdmsr -p 3 0x1A4
7

̶A̶l̶l̶ ̶p̶r̶e̶f̶e̶t̶c̶h̶e̶r̶s̶ ̶t̶u̶r̶n̶e̶d̶ ̶o̶f̶f̶.̶ Only IP prefetcher enabled

 Performance counter stats for 'taskset -c 3 ./bin':

     1 672 643 256      L1-dcache-load-misses     # 1893,36% of all L1-dcache hits    (24,92%)
        88 342 382      L1-dcache-loads                                               (24,96%)
     3 411 575 868      L1-dcache-stores                                              (25,00%)
     1 672 628 218      l2_rqsts.all_rfo                                              (25,04%)
            10 585      l2_rqsts.rfo_hit                                              (25,04%)
     1 684 510 576      l2_rqsts.rfo_miss                                             (25,04%)
            10 042      l2_rqsts.all_pf                                               (25,04%)
             4 368      l2_rqsts.pf_hit                                               (25,05%)
             9 135      l2_rqsts.pf_miss                                              (25,05%)
     1 684 136 160      offcore_requests.demand_rfo                                     (25,05%)
     3 316 673 543      offcore_response.demand_rfo.any_response                                     (25,05%)
           133 322      dTLB-load-misses          #    0,21% of all dTLB cache hits   (25,03%)
        64 283 883      dTLB-loads                                                    (24,99%)
        26 195 527      dTLB-store-misses                                             (24,95%)
     3 392 779 428      dTLB-stores                                                   (24,91%)
    39 627 346 050      cycles                                                        (24,88%)

       9,710779347 seconds time elapsed

       9,610209000 seconds user
       0,099981000 seconds sys

UPD:

$ sudo rdmsr -p 3 0x1A4
f

All prefetchers disabled

 Performance counter stats for 'taskset -c 3 ./bin':

     1 695 710 457      L1-dcache-load-misses     # 2052,21% of all L1-dcache hits    (23,47%)
        82 628 503      L1-dcache-loads                                               (23,47%)
     3 429 579 614      L1-dcache-stores                                              (23,47%)
     1 682 110 906      l2_rqsts.all_rfo                                              (23,51%)
            12 315      l2_rqsts.rfo_hit                                              (23,55%)
     1 672 591 830      l2_rqsts.rfo_miss                                             (23,55%)
                 0      l2_rqsts.all_pf                                               (23,55%)
                 0      l2_rqsts.pf_hit                                               (23,55%)
                12      l2_rqsts.pf_miss                                              (23,55%)
     1 662 163 396      offcore_requests.demand_rfo                                     (23,55%)
     3 282 743 626      offcore_response.demand_rfo.any_response                                     (23,55%)
           126 739      dTLB-load-misses          #    0,21% of all dTLB cache hits   (23,55%)
        59 790 090      dTLB-loads                                                    (23,55%)
        26 373 257      dTLB-store-misses                                             (23,55%)
     3 426 860 516      dTLB-stores                                                   (23,55%)
    38 282 401 051      cycles                                                        (23,51%)

       9,377335173 seconds time elapsed

       9,281050000 seconds user
       0,096010000 seconds sys

Even though prefetchers are disabled perf reports 12 as pf_miss (reproducible across different runs with different values). This is probably counting error. Also 1 672 591 830 l2_rqsts.rfo_miss has slightly larger value then 1 662 163 396 offcore_requests.demand_rfo which I also tend to interpret as counting error.

Hypothesis: DCU RFO Prefetch missing L2 and going off core are accounted in offcore_requests.demand_rfo.

The hypothesis works if L2-streamer switched off: 102 019 160 l2_rqsts.pf_miss + 1 579 107 608 l2_rqsts.rfo_miss = 1 681 126 768; 1 661 232 864 offcore_requests.demand_rfo

The hypothesis also works if all the prefetchers turned off: 1 684 510 576 l2_rqsts.rfo_miss; 1 684 136 160 offcore_requests.

In case of all PF turned off L1-dcache-load-misses is approximately equal to l2_rqsts.rfo_miss which in turns equals to offcore_requests.demand_rfo

The thing I still have no idea about is why offcore_response.demand_rfo.any_response has much larger value then offcore_requests.demand_rfo

Solution

It looks to me that the loop is writing to 2^18 cache lines and there is an outer loop (not shown in the question) that executes the inner loop (which is the one shown) 6400 times. So the expected total number of demand RFOs is 2^18*6400 = 1,677,721,600 and the expected number of retired store instructions is 1677721600*2 = 3,355,443,200. The measured number of stores L1-dcache-stores is about 3.410 billion, which is about 55 million more than what is expected. This event count should be accurate, so I presume that there is other code not shown in the question that is affecting the event counts. The load event counts also indicate that there are many loads coming from somewhere, which have a significant impact on the counts of the events l2_rqsts.all_pf, l2_rqsts.pf_hit, l2_rqsts.pf_miss. I've already asked whether there are any other significant pieces of code included in the measurements in my comment.

From the results of the first experiment with all prefetchers enabled, it appears that l2_rqsts.rfo_hit + offcore_requests.demand_rfo add up to an amount that is nearly equal to the expected number of demand RFOs. The L2 streamer can actually prefetch RFOs as documented in the Intel optimization manual, which explains how there can be l2_rqsts.rfo_hit. I don't know why l2_rqsts.rfo_miss is not equal to offcore_requests.demand_rfo. I think the event offcore_requests.demand_rfo is accurate. Try to disable only the L1D prefetchers and keep the L2 prefetchers enabled and see whether execution time increases. If the L1D prefetchers actually send any significant number of RFOs, there should be enough write hits in the L1D such that it makes a difference in performance.

The results of the second experiment with the L2 streamer disabled are very close to what is expected. l2_rqsts.rfo_hit is very small and l2_rqsts.all_rfo is nearly equal to offcore_requests.demand_rfo, which is equal to the expected number of demand RFOs. This provides an experimental evidence that the L1D prefetchers don't prefetch RFOs. l2_rqsts.all_pf should be zero in this case since both L2 prefetchers are disabled.

In the last experiment, you've only turned off three of the four data cache prefetchers; you missed the DCU IP prefetcher. The count of 2_rqsts.all_rfo in this case is even closer to what is excepted. Try to disable the DCU IP prefetcher as well and see whether l2_rqsts.rfo_hit (and maybe l2_rqsts.all_pf) become zero.

Erratum 058 in the spec update document of your processor says that offcore_response.demand_rfo.any_response may overcount and that offcore_requests.demand_rfo can be used instead. This explains why offcore_response.demand_rfo.any_response is larger than what is expected in all of the experiments and it also suggests that offcore_requests.demand_rfo is reliable.