Search code examples
linux-kernelperformancecounterperfmemory-accessintel-pmu

Perf Result Conflict During Multiplexing


I have an Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz (Haswell) processor (Linux 4.15.0-20-generic kernel). In a relatively idle situation, I ran the following Perf commands and their outputs are shown, below. The counters are offcore_response.all_data_rd.l3_miss.local_dram, offcore_response.all_code_rd.l3_miss.local_dram and mem_load_uops_retired.l3_miss:

sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss
^C
 Performance counter stats for 'system wide':

           229,579      offcore_response.all_data_rd.l3_miss.local_dram                                     (99.72%)
           489,151      offcore_response.all_code_rd.l3_miss.local_dram                                     (99.77%)
           110,543      mem_load_uops_retired.l3_miss                                     (99.79%)

       2.868899111 seconds time elapsed

As can be seen, event multiplexing occurred due to PMU sharing among these three events. In a similar scenario, I used the same command, except that I appended :D (mentioned in The Linux perf Event Scheduling Algorithm) to prevent multiplexing for the third event:

sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss:D
^C
 Performance counter stats for 'system wide':

           539,397      offcore_response.all_data_rd.l3_miss.local_dram                                     (68.71%)
           890,344      offcore_response.all_code_rd.l3_miss.local_dram                                     (68.67%)
           193,555      mem_load_uops_retired.l3_miss:D                                   

       2.853095575 seconds time elapsed

But adding :D is leading to much larger values for all counters and this seems to occur only when event multiplexing occurs. Is this output normal? Are the percentage values in parentheses valid? How could the differences in counter values be prevented?


UPDATE:

I also traced the following loop implementation:

#include <iostream>

using namespace std;

int main()
{
        for (unsigned long i = 0; i < 3 * 1e9; i++)
                ;

        return 0;
}

This time Perf was executed 7 (both with and without :D) but not using the -a option. The commands are as follows:

sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss ./loop

and

sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss:D ./loop

The values of the three counters are compared in the following figures: enter image description here enter image description here enter image description here For all_data_rd, the :D has almost no effect, for all_code reduces the values and for load_uops_retired causes larger values.


UPDATE 2:

Based on Peter Cordes's comments, I used a memory-intensive program as follows:

#include <iostream>
#include <cstring>

#define DIM_SIZE 10000

using namespace std;

char from [DIM_SIZE][DIM_SIZE], to [DIM_SIZE][DIM_SIZE];

int main()
{
        for (char x = 'a'; x <= 'z'; x++)
        {
                // set the 100 million element char array 'from' 
to x
                for (int i = 0; i < DIM_SIZE; i++)
                        memset(from[i], x, DIM_SIZE);

                // copy the entire char array 'from' to char array 'to'
                for (int i = 0; i < DIM_SIZE; i++)
                        memcpy(to[i], from[i], DIM_SIZE);
        }

        return 0;
} 

The following Perf commands and outputs show that the counter values are almost the same:

sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss:D ./loop

 Performance counter stats for './loop':

        19,836,745      offcore_response.all_data_rd.l3_miss.local_dram                                     (50.04%)
            47,309      offcore_response.all_code_rd.l3_miss.local_dram                                     (49.96%)
         6,556,957      mem_load_uops_retired.l3_miss:D                                   

       0.592795335 seconds time elapsed

and

    sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram,mem_load_uops_retired.l3_miss ./loop

 Performance counter stats for './loop':

        18,742,540      offcore_response.all_data_rd.l3_miss.local_dram                                     (66.64%)
            76,854      offcore_response.all_code_rd.l3_miss.local_dram                                     (66.64%)
         6,967,919      mem_load_uops_retired.l3_miss                                     (66.72%)

       0.575828303 seconds time elapsed

Solution

  • In your first two tests, you're not doing anything to generate a consistent amount of off-core memory traffic, so all you're measuring is the background "noise" on a mostly-idle system. (System-wide for the first test with -a, for the second just interrupt handlers which run some code that does miss in cache while this task is the current on a logical core.)

    You never said anything about whether those differences are repeatable across runs, e.g. with perf stat ... -r5 to re-run the same test 5 times, and print average and variance. I'd expect it's not repeatable, just random fluctuation of background stuff like network and keyboard/mouse interrupts, but if the :D version makes a consistent or statistically significant difference that might be interesting.


    In your 2nd test, the loop you used won't create any extra memory traffic; it will either be purely registers, or if compiled without optimization will load/store the same one cache line so everything will hit in L1d cache. So you still get a tiny amount of counts from it, mostly from the CRT startup and libc init code. From your main itself, probably one code fetch and zero data loads that miss in L3 since stack memory was probably already hot in cache.


    As I commented, a loop that does some memcpy or memset might make sense, or a simple known-good already written example is the STREAM benchmark, so that third test is finally something useful.

    The code-read counts in the 3rd test are also just idle noise: glibc memset and memcpy are unrolled but small enough that they fit in the uop cache, so it doesn't even need to touch L1i cache.

    We do have an interesting difference in counts for mem_load_uops_retired.l3_miss with / without :D, and off-core data-reads L3 misses.
    With the program having two different phases (memset and memcpy), different multiplexing timing will get sample different amounts of that loop. Extrapolating to the whole time from a few of those sample periods won't always be correct.

    If memset uses NT stores (or rep stosb on a CPU with ERMSB), it won't be doing any reads, just writes. Those methods use a no-RFO store protocol to avoid fetching the old value of a cache line, because they're optimized for overwriting the full line. Without that, plain stores could generate offcore_response.all_data_rd.l3_miss.local_dram I think, if RFO (Read For Ownership) count as part of all-data-read. (On my Skylake, perf list has separate events for offcore_response.demand_data_rd vs. offcore_response.demand_rfo, but "all" would include both, and prefetches (non-"demand").)

    memcpy has to read its source data, so it has a source of mem_load_uops_retired.l3_miss. It can also use NT stores or ERMSB to avoid load requests on the destination.

    Glibc memcpy / memset do use NT stores or rep movsb / rep stosb for large-enough buffers. So there will be a large difference in rate of offcore_response.all_data_rd.l3_miss.local_dram as well as mem_load_uops_retired.l3_miss between the memcpy and memset portions of your workload.

    (I wrote memcpy or memset when I was suggesting creating a simple consistent workload because that would make it uniform with time. I hadn't looked carefully at what events you were counting, that it was only load uops and offcore load requests. So memset wasn't a useful suggestion.)

    And of course code reads that miss all the way to L3 are hard to generate. In hand-written asm you could make a huge block of fully-unrolled instructions with GAS .rept 10000000 ; lea %fs:0x1234(,%rax,4), %rcx ; .endr. Or with NASM times 10000000 lea rcx, [fs: 0x1234 + rax*4]. (The segment prefix is just there to make it longer, no effect on the result.) That's a long-ish instruction (7 bytes) with no false dependency that can run at least 2 per clock on modern Intel and AMD CPUs, so should go quickly through the CPUs legacy decoders, maybe fast enough to exceed prefetch and result in lots of demand-loads for code that miss in L3.