Search code examples
c++performancestatic-linkingc++-chrono

Significant performance difference of std clock between different machines


Testing something else I stumbled across something that I haven't managed to figure out yet.

Let's look at this snippet:

#include <iostream>
#include <chrono>

int main () {
  int i = 0;
  using namespace std::chrono_literals;

  auto const end = std::chrono::system_clock::now() + 5s;
  while (std::chrono::system_clock::now() < end) {
    ++i;
  }
  std::cout << i;
}

I've noticed that the counts heavily depend on the machine I execute it on.
I've compiled with gcc 7.3,8.2, and clang 6.0 with std=c++17 -O3.

On i7-4790 (4.17.14-arch1-1-ARCH kernel): ~3e8
but on a Xeon E5-2630 v4 (3.10.0-514.el7.x86_64): ~8e6

Now this is a difference that I would like to understand so I've checked with perf stat -d

on the i7:

       4999.419546      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
               120      page-faults:u             #    0.024 K/sec                  
    19,605,598,394      cycles:u                  #    3.922 GHz                      (49.94%)
    33,601,884,120      instructions:u            #    1.71  insn per cycle           (62.48%)
     7,397,994,820      branches:u                # 1479.771 M/sec                    (62.53%)
            34,788      branch-misses:u           #    0.00% of all branches          (62.58%)
    10,809,601,166      L1-dcache-loads:u         # 2162.171 M/sec                    (62.41%)
            13,632      L1-dcache-load-misses:u   #    0.00% of all L1-dcache hits    (24.95%)
             3,944      LLC-loads:u               #    0.789 K/sec                    (24.95%)
             1,034      LLC-load-misses:u         #   26.22% of all LL-cache hits     (37.42%)

       5.003180401 seconds time elapsed

       4.969048000 seconds user
       0.016557000 seconds sys

Xeon:

       5001.000000      task-clock (msec)         #    0.999 CPUs utilized          
                42      context-switches          #    0.008 K/sec                  
                 2      cpu-migrations            #    0.000 K/sec                  
               412      page-faults               #    0.082 K/sec                  
    15,100,238,798      cycles                    #    3.019 GHz                      (50.01%)
       794,184,899      instructions              #    0.05  insn per cycle           (62.51%)
       188,083,219      branches                  #   37.609 M/sec                    (62.49%)
            85,924      branch-misses             #    0.05% of all branches          (62.51%)
       269,848,346      L1-dcache-loads           #   53.959 M/sec                    (62.49%)
           246,532      L1-dcache-load-misses     #    0.09% of all L1-dcache hits    (62.51%)
            13,327      LLC-loads                 #    0.003 M/sec                    (49.99%)
             7,417      LLC-load-misses           #   55.65% of all LL-cache hits     (50.02%)

       5.006139971 seconds time elapsed

What pops out is the low amount of instructions per cycle on the Xeon as well as the nonzero context-switches that I don't understand. However, I wasn't able to use these diagnostics to come up with an explanation.

And to add a bit more weirdness to the problem, when trying to debug I've also compiled statically on one machine and executed on the other.

On the Xeon the statically compiled executable gives a ~10% lower output with no difference between compiling on xeon or i7.
Doing the same thing on the i7 both the counter actually drops from 3e8 to ~2e7

So in the end I'm now left with two questions:

  • Why do I see such a significant difference between the two machines.
  • Why does a statically linked exectuable perform worse while I would expect the oposite?

Edit: after updating the kernel on the centos 7 machine to 4.18 we actually see an additional drop from ~ 8e6 to 5e6.

perf interestingly shows different numbers though:

   5002.000000      task-clock:u (msec)       #    0.999 CPUs utilized          
             0      context-switches:u        #    0.000 K/sec                  
             0      cpu-migrations:u          #    0.000 K/sec                  
           119      page-faults:u             #    0.024 K/sec                  
   409,723,790      cycles:u                  #    0.082 GHz                      (50.00%)
   392,228,592      instructions:u            #    0.96  insn per cycle           (62.51%)
   115,475,503      branches:u                #   23.086 M/sec                    (62.51%)
        26,355      branch-misses:u           #    0.02% of all branches          (62.53%)
   115,799,571      L1-dcache-loads:u         #   23.151 M/sec                    (62.51%)
        42,327      L1-dcache-load-misses:u   #    0.04% of all L1-dcache hits    (62.50%)
            88      LLC-loads:u               #    0.018 K/sec                    (49.96%)
             2      LLC-load-misses:u         #    2.27% of all LL-cache hits     (49.98%)

   5.005940327 seconds time elapsed

   0.533000000 seconds user
   4.469000000 seconds sys

It's interesting that there are no more context switches and istructions per cycle went up significantly but the cycles and therefore colck are super low!


Solution

  • I've been able to reproduce the respective measurements on the two machines, thanks to @Imran's comment above. (Posting this answer to close the question, if Imran should post one I'm happy to accept his instead)

    It indeed is related to the available clocksource. The XEON, unfortunately, had the notsc flag in its kernel parameters which is why the tsc clocksource wasn't available and selected.

    Thus for anyone running into this problem:
    1. check your clocksource in /sys/devices/system/clocksource/clocksource0/current_clocksource
    2. check available clocksources in /sys/devices/system/clocksource/clocksource0/available_clocksource
    3. If you can't find tsc, check dmesg | grep tsc to check you kernel parameters for notsc