assembler benchmarking on intel using rdtsc is giving strange answers, why?

A while ago, I asked a question on stack overflow and was shown how to execute the rdtsc opcode in C++. I recently created a benchmark function using rdtsc as follows:

inline unsigned long long rdtsc() {
  unsigned int lo, hi;
  asm volatile (
     "cpuid \n"
     "rdtsc" 
   : "=a"(lo), "=d"(hi) /* outputs */
   : "a"(0)             /* inputs */
   : "%ebx", "%ecx");     /* clobbers*/
  return ((unsigned long long)lo) | (((unsigned long long)hi) << 32);
}

typedef uint64_t (*FuncOneInt)(uint32_t n);
/**
     time a function that takes an integer parameter and returns a 64 bit number
     Since this is capable of timing in clock cycles, we won't have to do it a
     huge number of times and divide, we can literally count clocks.
     Don't forget that everything takes time including getting into and out of the
     function.  You may want to time an empty function.  The time to do the computation
     can be compute by taking the time of the function you want minus the empty one.
 */
void clockBench(const char* msg, uint32_t n, FuncOneInt f) {
    uint64_t t0 = rdtsc();
    uint64_t r = f(n);
    uint64_t t1 = rdtsc();
    std::cout << msg << "n=" << n << "\telapsed=" << (t1-t0) << '\n';
}

I therefore assumed that if I benchmarked a function, I would have (roughly) the number of clock cycles that it took to execute. I also assumed that if I wanted to subtract the number of clock cycles that it took to get into or out of the function, I should benchmark an empty function, then write one with the desired code inside.

Here is a sample:

uint64_t empty(uint32_t n) {
    return 0;
}

uint64_t sum1Ton(uint32_t n) {
    uint64_t s = 0;
    for (int i = 1; i <= n; i++)
        s += i;
    return s;
}

The code is compiled using

g++ -g -O2

I could understand if there is some error due to an interrupt or some other condition, but given that these routines are short, and n is chosen to be small, I assumed that I could see the real numbers. But to my surprise, this is the output from two successive runs

empty n=100 elapsed=438
Sum 1 to n=100  elapsed=887

empty n=100 elapsed=357
Sum 1 to n=100  elapsed=347

Consistently the empty function shows that it takes way more than it should.

After all, there are only a few instructions involved in getting in and out of the function. The real work is done in the loop. Never mind the fact that the variance is huge. In the second run, the empty function claims to be taking 357 clock cycles and the sum takes less, which is ridiculous.

What is happening?

Solution

Consistently the empty function shows that it takes way more than it should.

You have cpuid inside the timed interval. cpuid on Intel Sandybridge-family CPUs takes 100 to 250 core clock cycles (depending on the inputs, which you neglected to set), according to Agner Fog's testing. (https://agner.org/optimize/).

But you're not measuring core clock cycles, you're measuring RDTSC reference cycles, which can be significantly shorter. (e.g. my Skylake i7-6700k idles at 800MHz, but the reference clock freq is 4008 MHz.) See Get CPU cycle count? for my attempt at a canonical answer on rdtsc.

Warm up the CPU first, or run a pause busy loop on another core to keep it pegged at max (assuming it's a desktop / laptop dual or quad core, where all core freqs are locked together.)

Never mind the fact that the variance is huge. In the second run, the empty function claims to be taking 357 clock cycles and the sum takes less, which is ridiculous.

Is that effect also consistent?

Maybe your CPU ramped up to full speed during / after printing the 3rd line of messages, making the last timed region run a lot faster? (Why does this delay-loop start to run faster after several iterations with no sleep?).

IDK how much effect different garbage in eax and ecx before cpuid could have. Replace it with lfence to eliminate that and use a much lower overhead way to serialize rdtsc.