Search code examples
linuxperfcontext-switch

"vmstat" and "perf stat -a" show different numbers for context-switching


I'm trying to understand the context-switching rate on my system (running on AWS EC2), and where the switches are coming from. Just getting the number is already confusing, as two tools that I know can output such a metric give me different results. Here's the output from vmstat:

$ vmstat -w 2
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
 r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
 8  0          0     443888     492304    8632452    0    0     0     1    0    0  14  2  84  0  0
37  0          0     444820     492304    8632456    0    0     0    20 131602 155911  43  5  52  0  0
 8  0          0     445040     492304    8632460    0    0     0    42 131117 147812  46  4  50  0  0
13  0          0     446572     492304    8632464    0    0     0    34 129154 142260  49  4  46  0  0

The number is ~140k-160k/sec.

But perf tells something else:

$ sudo perf stat -a
 Performance counter stats for 'system wide':

    2980794.013800      cpu-clock (msec)          #   35.997 CPUs utilized
        12,335,935      context-switches          #    0.004 M/sec
         2,086,162      cpu-migrations            #    0.700 K/sec
            11,617      page-faults               #    0.004 K/sec
...

0.004 M/sec is apparently 4k/sec.

Why is there a disparity between the two tools? Am I misinterpreting something in either of them, or are their CS metrics somehow different?

FWIW, I've tried doing the same on a machine running a different workload, and the difference there is even twice larger.

Environment:

  • AWS EC2 c5.9xlarge instance
  • Amazon Linux, kernel 4.14.94-73.73.amzn1.x86_64
  • The service runs on Docker 18.06.1-ce

Solution

  • Some recent versions of perf have a unit-scaling bug in the printing code. Manually do 12.3M / wall-time and see if that's sane. (spoiler alert: it is according to OP's comment.)

    https://lore.kernel.org/patchwork/patch/1025968/

    Commit 0aa802a79469 ("perf stat: Get rid of extra clock display function") introduced the bug in mainline Linux 4.19-rc1 or so.

    Thus, perf_stat__update_shadow_stats() now saves scaled values of clock events in msecs, instead of original nsecs. But while calculating values of shadow stats we still consider clock event values in nsecs. This results in a wrong shadow stat values.

    Commit 57ddf09173c1 on Mon, 17 Dec 2018 fixed it in 5.0-rc1, eventually being released with perf upstream version 5.0.


    Vendor kernel trees that cherry-pick commits for their stable kernels might have the bug or have fixed the bug earlier.