Whiskey Lake i7-8565U
/Ubuntu 18.04
/HT enabled
Consider the following code that writes some garbage data that happened to be in registers ymm0
and ymm1
into 16 MiB statically allocated WB memory in a loop consisting of 6400 iteration (so page fault impact is negligible):
;rdx = 16MiB >> 3
xor rcx, rcx
store_loop:
vmovdqa [rdi + rcx*8], ymm0
vmovdqa [rdi + rcx*8 + 0x20], ymm1
add rcx, 0x08
cmp rdx, rcx
ja store_loop
Using taskset -c 3 ./bin
I'm measuring RFO requests by this example and here is the results:
Performance counter stats for 'taskset -c 3 ./bin':
1 695 029 000 L1-dcache-load-misses # 2325,60% of all L1-dcache hits (24,93%)
72 885 527 L1-dcache-loads (24,99%)
3 411 237 144 L1-dcache-stores (25,05%)
946 374 671 l2_rqsts.all_rfo (25,11%)
451 047 123 l2_rqsts.rfo_hit (25,15%)
495 868 337 l2_rqsts.rfo_miss (25,15%)
2 367 931 179 l2_rqsts.all_pf (25,14%)
568 168 558 l2_rqsts.pf_hit (25,08%)
1 785 300 075 l2_rqsts.pf_miss (25,02%)
1 217 663 928 offcore_requests.demand_rfo (24,96%)
1 963 262 031 offcore_response.demand_rfo.any_response (24,91%)
108 536 dTLB-load-misses # 0,20% of all dTLB cache hits (24,91%)
55 540 014 dTLB-loads (24,91%)
26 310 618 dTLB-store-misses (24,91%)
3 412 849 640 dTLB-stores (24,91%)
27 265 942 916 cycles (24,91%)
6,681218065 seconds time elapsed
6,584426000 seconds user
0,096006000 seconds sys
The description of l2_rqsts.all_rfo
is
Counts the total number of RFO (read for ownership) requests to L2 cache. L2 RFO requests include both L1D demand RFO misses as well as L1D RFO prefetches.
suggesting that DCU can do some sort of RFO prefetches. It was not clear from the desctiption of DCU from Intel Optimization Manual/2.6.2.4
:
Data cache unit (DCU) prefetcher — This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.
So I guess that DCU follows the "access type": If it is RFO then DCU does RFO prefetch.
All of those RFO prefetches should go to L2 along with demand RFO and only some of them (l2_rqsts.rfo_miss
) should go to the uncore. The offcore_requests.demand_rfo
counts only demand rfo, but l2_rqsts.rfo_miss
accounts all rfo (demand + dcu prefectch) meaning that the inequality offcore_requests.demand_rfo < l2_rqsts.rfo_miss
should be held.
QUESTION 1: Why is l2_rqsts.rfo_miss
much less then offcore_requests.demand_rfo
(even l2_rqsts.all_rfo
less then offcore_requests.demand_rfo
)
I expected that demand offcore_requests.demand_rfo
can be matched up with offcore_response.demand_rfo.any_response
so there should be approximately equal numbers for those Core PMU events
QUESTION 2: Why is offcore_response.demand_rfo.any_response
almost 1.5 times more then offcore_requests.demand_rfo
?
I'm guessing that L2-streamer also does some RFO prefetches, but it should not be accounted in offcore_requests.demand_rfo
anyway.
UPD:
$ sudo rdmsr -p 3 0x1A4
1
L2-Streamer off
Performance counter stats for 'taskset -c 3 ./bin':
1 672 633 985 L1-dcache-load-misses # 2272,75% of all L1-dcache hits (24,96%)
73 595 056 L1-dcache-loads (25,00%)
3 409 928 481 L1-dcache-stores (25,00%)
1 593 190 436 l2_rqsts.all_rfo (25,04%)
16 582 758 l2_rqsts.rfo_hit (25,07%)
1 579 107 608 l2_rqsts.rfo_miss (25,07%)
124 294 129 l2_rqsts.all_pf (25,07%)
22 674 837 l2_rqsts.pf_hit (25,07%)
102 019 160 l2_rqsts.pf_miss (25,07%)
1 661 232 864 offcore_requests.demand_rfo (25,02%)
3 287 688 173 offcore_response.demand_rfo.any_response (24,98%)
139 247 dTLB-load-misses # 0,25% of all dTLB cache hits (24,94%)
56 823 458 dTLB-loads (24,90%)
26 343 286 dTLB-store-misses (24,90%)
3 384 264 241 dTLB-stores (24,94%)
37 782 766 410 cycles (24,94%)
9,320791474 seconds time elapsed
9,213383000 seconds user
0,099928000 seconds sys
As can be seen offcore_requests.demand_rfo
got closer to l2_rqsts.rfo_miss
, but still there is some difference. In the Intel docs of OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD
I found the following:
Note: A prefetch promoted to Demand is counted from the promotion point.
So my guess is that L2-prefetches were promoted to Demand and counted in the Demand offcore requests. But it does not explain the difference between offcore_response.demand_rfo.any_response
and offcore_requests.demand_rfo
which is almost twice now:
offcore_requests.demand_rfo 1 661 232 864
vs
offcore_response.demand_rfo.any_response 3 287 688 173
UPD:
$ sudo rdmsr -p 3 0x1A4
3
All L2 prefetchers off
Performance counter stats for 'taskset -c 3 ./bin':
1 686 560 752 L1-dcache-load-misses # 2138,14% of all L1-dcache hits (23,44%)
78 879 830 L1-dcache-loads (23,48%)
3 409 552 015 L1-dcache-stores (23,53%)
1 670 187 931 l2_rqsts.all_rfo (23,56%)
15 674 l2_rqsts.rfo_hit (23,59%)
1 676 538 346 l2_rqsts.rfo_miss (23,58%)
156 206 l2_rqsts.all_pf (23,59%)
14 436 l2_rqsts.pf_hit (23,59%)
173 163 l2_rqsts.pf_miss (23,59%)
1 671 606 174 offcore_requests.demand_rfo (23,59%)
3 301 546 970 offcore_response.demand_rfo.any_response (23,59%)
140 335 dTLB-load-misses # 0,21% of all dTLB cache hits (23,57%)
68 010 546 dTLB-loads (23,53%)
26 329 766 dTLB-store-misses (23,49%)
3 429 416 286 dTLB-stores (23,45%)
39 462 328 435 cycles (23,42%)
9,699770319 seconds time elapsed
9,596304000 seconds user
0,099961000 seconds sys
Now the total number of prefetch requests to l2 (from all prefetchers) is 156 206 l2_rqsts.all_pf
.
UPD:
$ sudo rdmsr -p 3 0x1A4
7
̶A̶l̶l̶ ̶p̶r̶e̶f̶e̶t̶c̶h̶e̶r̶s̶ ̶t̶u̶r̶n̶e̶d̶ ̶o̶f̶f̶.̶ Only IP prefetcher enabled
Performance counter stats for 'taskset -c 3 ./bin':
1 672 643 256 L1-dcache-load-misses # 1893,36% of all L1-dcache hits (24,92%)
88 342 382 L1-dcache-loads (24,96%)
3 411 575 868 L1-dcache-stores (25,00%)
1 672 628 218 l2_rqsts.all_rfo (25,04%)
10 585 l2_rqsts.rfo_hit (25,04%)
1 684 510 576 l2_rqsts.rfo_miss (25,04%)
10 042 l2_rqsts.all_pf (25,04%)
4 368 l2_rqsts.pf_hit (25,05%)
9 135 l2_rqsts.pf_miss (25,05%)
1 684 136 160 offcore_requests.demand_rfo (25,05%)
3 316 673 543 offcore_response.demand_rfo.any_response (25,05%)
133 322 dTLB-load-misses # 0,21% of all dTLB cache hits (25,03%)
64 283 883 dTLB-loads (24,99%)
26 195 527 dTLB-store-misses (24,95%)
3 392 779 428 dTLB-stores (24,91%)
39 627 346 050 cycles (24,88%)
9,710779347 seconds time elapsed
9,610209000 seconds user
0,099981000 seconds sys
UPD:
$ sudo rdmsr -p 3 0x1A4
f
All prefetchers disabled
Performance counter stats for 'taskset -c 3 ./bin':
1 695 710 457 L1-dcache-load-misses # 2052,21% of all L1-dcache hits (23,47%)
82 628 503 L1-dcache-loads (23,47%)
3 429 579 614 L1-dcache-stores (23,47%)
1 682 110 906 l2_rqsts.all_rfo (23,51%)
12 315 l2_rqsts.rfo_hit (23,55%)
1 672 591 830 l2_rqsts.rfo_miss (23,55%)
0 l2_rqsts.all_pf (23,55%)
0 l2_rqsts.pf_hit (23,55%)
12 l2_rqsts.pf_miss (23,55%)
1 662 163 396 offcore_requests.demand_rfo (23,55%)
3 282 743 626 offcore_response.demand_rfo.any_response (23,55%)
126 739 dTLB-load-misses # 0,21% of all dTLB cache hits (23,55%)
59 790 090 dTLB-loads (23,55%)
26 373 257 dTLB-store-misses (23,55%)
3 426 860 516 dTLB-stores (23,55%)
38 282 401 051 cycles (23,51%)
9,377335173 seconds time elapsed
9,281050000 seconds user
0,096010000 seconds sys
Even though prefetchers are disabled perf
reports 12
as pf_miss
(reproducible across different runs with different values). This is probably counting error. Also 1 672 591 830 l2_rqsts.rfo_miss
has slightly larger value then 1 662 163 396 offcore_requests.demand_rfo
which I also tend to interpret as counting error.
Hypothesis: DCU RFO Prefetch missing L2 and going off core are accounted in offcore_requests.demand_rfo
.
The hypothesis works if L2-streamer switched off: 102 019 160 l2_rqsts.pf_miss + 1 579 107 608 l2_rqsts.rfo_miss = 1 681 126 768
; 1 661 232 864 offcore_requests.demand_rfo
The hypothesis also works if all the prefetchers turned off: 1 684 510 576 l2_rqsts.rfo_miss
; 1 684 136 160 offcore_requests
.
In case of all PF turned off L1-dcache-load-misses
is approximately equal to l2_rqsts.rfo_miss
which in turns equals to offcore_requests.demand_rfo
The thing I still have no idea about is why offcore_response.demand_rfo.any_response
has much larger value then offcore_requests.demand_rfo
It looks to me that the loop is writing to 2^18 cache lines and there is an outer loop (not shown in the question) that executes the inner loop (which is the one shown) 6400 times. So the expected total number of demand RFOs is 2^18*6400 = 1,677,721,600 and the expected number of retired store instructions is 1677721600*2 = 3,355,443,200. The measured number of stores L1-dcache-stores
is about 3.410 billion, which is about 55 million more than what is expected. This event count should be accurate, so I presume that there is other code not shown in the question that is affecting the event counts. The load event counts also indicate that there are many loads coming from somewhere, which have a significant impact on the counts of the events l2_rqsts.all_pf
, l2_rqsts.pf_hit
, l2_rqsts.pf_miss
. I've already asked whether there are any other significant pieces of code included in the measurements in my comment.
From the results of the first experiment with all prefetchers enabled, it appears that l2_rqsts.rfo_hit
+ offcore_requests.demand_rfo
add up to an amount that is nearly equal to the expected number of demand RFOs. The L2 streamer can actually prefetch RFOs as documented in the Intel optimization manual, which explains how there can be l2_rqsts.rfo_hit
. I don't know why l2_rqsts.rfo_miss
is not equal to offcore_requests.demand_rfo
. I think the event offcore_requests.demand_rfo
is accurate. Try to disable only the L1D prefetchers and keep the L2 prefetchers enabled and see whether execution time increases. If the L1D prefetchers actually send any significant number of RFOs, there should be enough write hits in the L1D such that it makes a difference in performance.
The results of the second experiment with the L2 streamer disabled are very close to what is expected. l2_rqsts.rfo_hit
is very small and l2_rqsts.all_rfo
is nearly equal to offcore_requests.demand_rfo
, which is equal to the expected number of demand RFOs. This provides an experimental evidence that the L1D prefetchers don't prefetch RFOs. l2_rqsts.all_pf
should be zero in this case since both L2 prefetchers are disabled.
In the last experiment, you've only turned off three of the four data cache prefetchers; you missed the DCU IP prefetcher. The count of 2_rqsts.all_rfo
in this case is even closer to what is excepted. Try to disable the DCU IP prefetcher as well and see whether l2_rqsts.rfo_hit
(and maybe l2_rqsts.all_pf
) become zero.
Erratum 058 in the spec update document of your processor says that offcore_response.demand_rfo.any_response
may overcount and that offcore_requests.demand_rfo
can be used instead. This explains why offcore_response.demand_rfo.any_response
is larger than what is expected in all of the experiments and it also suggests that offcore_requests.demand_rfo
is reliable.