Search code examples
multithreadingprofilingperf

Frequency-based sampling of multiple threads with perf record


From a great answer on here, my understanding is that perf samples every time an event counter exceeds a threshold. When asked to sample at a certain frequency, it tries to set the event counter threshold such that the overflow occurs at the specified interval, by adjusting its estimate of the rate of events per time.

I'm asking myself how this works with multiple threads, for instance with the default "cycles" event.

  • Is there a global "cycles" counter that samples whatever threads are running at the time when the overflow occurs? Or does each CPU have its own "cycles" counter that samples the thread that it is currently running, and if yes, does "each CPU" mean logical or physical cores? Or is it a counter per thread?
  • Are only cycles counted that were spent running the program?

I hope to be able to explain patterns like these – obviously the rate of events is being recalibrated, but it's not clear to me why:

tid  timestamp        event counter
5881 187296.210979:   15736902 cycles:
5881 187296.215945:   15664720 cycles:
5881 187296.221356:   15586918 cycles:
5881 187296.227022:          1 cycles:
5881 187296.227032:          1 cycles:
5881 187296.227037:         62 cycles:
5881 187296.227043:       6902 cycles:
5881 187296.227048:     822728 cycles:
5881 187296.231842:   90947120 cycles:

I'd expect it to calculate the next counter threshold at each sample, so after the sample at 187296.215945, it sets the next sample to occur after 15586918 cycles, right? But it doesn't make sense to set the threshold after 187296.221356 to 1 cycle, when the target frequency of 200Hz has been stable before. Is this interference from another thread?


Solution

  • Disclamer

    I am not a expert on this topic, but I found this question very interesting, so I tried to come up with an answer. Take this answer with a grain of salt. Corrections are welcome -- and maybe Cunningham's law will get us better answers.

    What cycles maps to

    According to the perf wiki, on Intel, perf uses the UNHALTED_CORE_CYCLES event.

    From the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 4, 18.2.1 Architectural Performance Monitoring Version 1

    Configuration facilities and counters are not shared between logical processors sharing a processor core.

    For AMD, the perf wiki, states that the CPU_CLK_UNHALTED hardware event is used. I couldn't find this event in the current Preliminary Processor Programming Reference (PPR) for AMD Family 19h Model 01h, Revision B1 Processors Volume 2 of 2, but I found this in section 2.1.17.1:

    There are six core performance event counters per thread, six performance events counters per L3 complex and four Data Fabric performance events counters

    I would conclude that the processors support tracking the cycles event per logical core, and I would assume it to be similar on ARM and other architectures (otherwise, I think the performance counters would be a lot less useful)

    What perf does

    Now perf has different sampling modes:

    The perf tool can be used to count events on a per-thread, per-process, per-cpu or system-wide basis. In per-thread mode, the counter only monitors the execution of a designated thread. When the thread is scheduled out, monitoring stops. When a thread migrated from one processor to another, counters are saved on the current processor and are restored on the new one.

    and

    By default, perf record operates in per-thread mode, with inherit mode enabled.

    From these sources, I would expect the following behavior from perf:

    • When a thread starts executing on a core, the performance counter is reset
    • As the thread runs, whenever the counter overflows, a sample is taken
    • If the thread stops executing, the monitoring stops

    Your questions

    So, I would conclude that

    Is there a global "cycles" counter that samples whatever threads are running at the time when the overflow occurs? Or does each CPU have its own "cycles" counter that samples the thread that it is currently running, and if yes, does "each CPU" mean logical or physical cores?

    Each logical core has its own counter.

    Or is it a counter per thread?

    It is a hardware counter on the cpu core, but perf allows you to use it as if it were per thread -- if a thread of a different program gets scheduled, this should not have any impact on you. By default, perf does not annotate thread information to the samples stored in perf.data. According to the man page, you can use -s or --stat to store this information. Then, perf report will allow you to analyze individual threads of your application.

    Are only cycles counted that were spent running the program?

    Yes, unless specified otherwise.

    Your output

    tid  timestamp        event counter
    5881 187296.210979:   15736902 cycles:
    5881 187296.215945:   15664720 cycles:
    5881 187296.221356:   15586918 cycles:
    5881 187296.227022:          1 cycles:
    5881 187296.227032:          1 cycles:
    5881 187296.227037:         62 cycles:
    5881 187296.227043:       6902 cycles:
    5881 187296.227048:     822728 cycles:
    5881 187296.231842:   90947120 cycles:
    

    What did you execute to get this output? Maybe I'm misinterpreting, but I would guess that the following happened:

    The points here are partly invalidated by the experiment below

    • You recorded with a specified target-frequency. That means perf tries to optimize the current overflow value of the hardware counter such that you get as many cycles overflows per second as you specified.
    • For the first three timestamps, threads of your program were executed on the CPU, this triggered high cycles counts. perf took samples approximately every 0.005s.
    • Then, it looks like your threads were not executed for that many cpu cycles per second anymore. Maybe it was waiting for IO operations most of its time?* Thus, the next sample was taken after 0.006s and the cycles count dropped to one. perf noticed that the actual sampling frequency had dropped, so it decremented the overflow threshold with the idea to keep the sampling rate stable.
    • Then, maybe the IO operation was finished and your threads started running for more cpu cycles per second again. This caused lots of cycles events, but with the lower overflow threshold, perf now took a sample after fewer events (after 0.00001s and 0.000005s for the next 3 samples). Perf incremented the overflow threshold back up during this period.
    • For the last sample, it seems to have arrived back at around 0.005s distance between samples

    Experiment

    I think the following might give more insights. Let's create a small, easy to understand workload:

    int main() {
        volatile unsigned int i = 0;
        while(1) {
            i++;
        }
    }
    

    gcc compiles the loop to four instructions: memory load, increment, memory store, jump. This utilizes one logical core, according to htop, just as you'd expect. I can simulate that it stopped executing as if it was waiting for IO or not scheduled, by simply using ctrl+Z in the shell to suspend it.

    Now, we run

    perf record -F 10 -p $(pidof my_workload)
    

    let it run for a moment. Then, use ctrl+Z to suspend execution. Wait for a moment and then use fg to resume execution. After a few seconds, end the program.

    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0,021 MB perf.data (65 samples) ]
    

    In my case, perf record wrote 65 samples. We can use perf script to inspect the sample data written and get (full output, because I think I might accidentally remove something important. This was on an Intel i5-6200U, Ubuntu 20.04, kernel 5.4.0-72-generic):

         my_workload 831622 344935.025844:          1 cycles:  ffffffffa0673594 native_write_msr+0x4 ([kernel.kallsyms])
         my_workload 831622 344935.025847:          1 cycles:  ffffffffa0673594 native_write_msr+0x4 ([kernel.kallsyms])
         my_workload 831622 344935.025849:       2865 cycles:  ffffffffa0673594 native_write_msr+0x4 ([kernel.kallsyms])
         my_workload 831622 344935.025851:   16863383 cycles:  ffffffffa12016f2 nmi_restore+0x25 ([kernel.kallsyms])
         my_workload 831622 344935.031948: 101431200645 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344935.693910:  269342015 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344935.794233:  268586235 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344935.893934:  269806654 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344935.994321:  269410272 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344936.094938:  271951524 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344936.195815:  269543174 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344936.295978:  269967653 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344936.397041:  266160499 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344936.497782:  265215251 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344936.596074:  269863048 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344936.696280:  269857624 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344936.796730:  269274440 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344936.897487:  269115742 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344936.997988:  266867300 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344937.097088:  269734778 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344937.196955:  270487956 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344937.297281:  270136625 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344937.397516:  269741183 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344943.438671:  173595673 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344943.512800:  251467821 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344943.604016:  274913168 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344943.703440:  276448269 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344943.803753:  275059394 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344943.903362:  276318281 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344944.005543:  266874454 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344944.105663:  268220344 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344944.205213:  269369912 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344944.305541:  267387036 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344944.405660:  266906130 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344944.506126:  266194513 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344944.604879:  268882693 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344944.705588:  266555089 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344944.804896:  268419478 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344944.905269:  267700541 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344945.004885:  267365839 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344945.103970:  269655126 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344945.203823:  269330033 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344945.304258:  267423569 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344945.403472:  269773962 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344945.504194:  275795565 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344945.606329:  271013012 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344945.703866:  277537908 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344945.803821:  277559915 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344945.904299:  277242574 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344946.005167:  272430392 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344946.104424:  275291909 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344946.204402:  275331204 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344946.304334:  273818645 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
         my_workload 831622 344946.403674:  275723986 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344946.456065:          1 cycles:  ffffffffa0673594 native_write_msr+0x4 ([kernel.kallsyms])
         my_workload 831622 344946.456069:          1 cycles:  ffffffffa0673594 native_write_msr+0x4 ([kernel.kallsyms])
         my_workload 831622 344946.456071:       2822 cycles:  ffffffffa0673594 native_write_msr+0x4 ([kernel.kallsyms])
         my_workload 831622 344946.456072:   17944993 cycles:  ffffffffa0673596 native_write_msr+0x6 ([kernel.kallsyms])
         my_workload 831622 344946.462714: 107477037825 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344946.804126:  281880047 cycles:      558f3623317e main+0x15 (/tmp/my_workload)
         my_workload 831622 344946.907508:  274093449 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344947.007473:  270795847 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344947.106277:  275006801 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344947.205757:  274972888 cycles:      558f36233178 main+0xf (/tmp/my_workload)
         my_workload 831622 344947.305405:  274436774 cycles:      558f3623317b main+0x12 (/tmp/my_workload)
    

    I think we can see two main things in this output

    • The sample at 344937.397516 seems to be the last sample before I suspended the program and 344943.438671 seems to be the first sample after it resumed. We a have a little lower cycles count here. Apart from that, it looks just like the other lines.
    • However, your pattern can be found directly after starting -- this is expected I'd say -- and at timestamp 344946.456065. With the annotation native_write_msr I think what we observe here is perf doing internal work. There was this question regarding what native_write_msr does. According to the comment of Peter to that question, this is the kernel programming hardware performance counters. It's still a bit strange. Maybe, after writing out the current buffer to perf.data, perf behaves just as if it was just started?

    * As Jérôme pointed out in the comments, there can be many reasons why less cycles events happened. I'd guess your program was not executed because it was sleeping or waiting for IO or memory access. It's also possible that your program simply wasn't scheduled to run by the OS during this time.

    If you're not measuring a specific workload, but your general system, it may also happen that the CPU reduces clock rate or goes into a sleep state because it has no work to do. I assumed that you probably did something like perf record ./my_program with my_program being a CPU intensive workload, so it think it was was unlikely that the cpu decided to sleep.