Unjustified time measurements with clock_gettime() due to code outside the profiled section

I am using the C program below (compiled with gcc) to measure the computing time of a given code section. The issue is that when usleep() function is used outside the code section being profiled, different measurements are obtained in elapsed_time.

#include <unistd.h>
#include <stdio.h>
#include <time.h>

int main()
{
    int f = 0;
    while (f < 10)
    {
        double sum = 1.0;
        struct timespec t0, t1;

        clock_gettime(CLOCK_MONOTONIC, &t0);

        // CODE SECTION BEING PROFILED
        for (unsigned int i = 0; i < 5e6; i++)
            sum += 0.001;
        // END OF CODE SECTION BEING PROFILED

        clock_gettime(CLOCK_MONOTONIC, &t1);


        double elapsed_time = t1.tv_sec - t0.tv_sec;
        elapsed_time += (t1.tv_nsec - t0.tv_nsec) * 1e-9;

        printf("%.2f ms\n", elapsed_time*1000 );

        f++;

        // Sleep causing fake clock_gettime measurements?
        usleep(50000);
    }

    return 0;
}

Below I paste the result of the programs when uleep(50000) is commented:

13.83 ms
14.30 ms
14.41 ms
13.77 ms
13.85 ms
13.88 ms
14.55 ms
13.28 ms
13.70 ms
13.56 ms

Now, the results when uleep(50000) is present:

15.37 ms
14.16 ms
**36.43 ms
39.39 ms
36.38 ms
36.27 ms
34.14 ms
38.52 ms
38.18 ms
37.53 ms**

I tried to measure times with different clocks in clock_gettime(), and also using the C++ std::chrono::high_resolution_clock. However, the results are similar in all cases. Moreover, I tried different gcc flags e.g. O2. In this case, although much smaller times are obtained, still some measured values when usleep() is present are ~3x higher than when usleep() is not present.

I found that the amount of microseconds used in usleep() affects the resulting measured times, e.g. with 10000us I obtain:

13.75 ms
13.54 ms
13.34 ms
14.15 ms
13.12 ms
12.79 ms
**28.72 ms
25.84 ms
26.18 ms
24.53 ms**

After all theses tests I cannot find what is causing these "fake" time measurements in this program.

Edit:

After some further testing I realized that the same problem is being caused by printf() (see the code below).

#include <unistd.h>
#include <stdio.h>
#include <time.h>

int main()
{
    double total_time = 0.0;
    int f = 0;
    while (f < 1000)
    {
        double sum = 1.0;
        struct timespec t0, t1;

        clock_gettime(CLOCK_MONOTONIC, &t0);

        // Code section being profiled
        for (unsigned int i = 0; i < 5e6; i++)
            sum += 0.001;
        // End of Code section being profiled

        clock_gettime(CLOCK_MONOTONIC, &t1);


        double elapsed_time = t1.tv_sec - t0.tv_sec;
        elapsed_time += (t1.tv_nsec - t0.tv_nsec) * 1e-9;
        total_time += elapsed_time;


        f++;

        // Sleep/printf causing fake clock_gettime measurements?
        printf("%.2f ms\n", elapsed_time*1000 );
        // usleep(10000);
    }
    printf("TOTAL TIME: %.2f ms\n", total_time*1000 );
    return 0;
}

Results

When I run this code compiled with g++ -g f, I obtain a total time of:

~13071.42 ms when line printf("%.2f ms\n", elapsed_time*1000 ); is present (nota that this line is again outside the section being profiled).
~12712.22 ms, when that line is commented.

If the program is compiled using optimization (g++ -O2), I obtain a total time of:

~3145.27 ms when that printf() is present.
~2741.45 ms when that printf() is not present.

With printf() the measured times increases to a lesser extent compared to usleep(). So, the usleep() doesn't seem to be the only function causing this problem... Anyway the measured time should not be affected for code placed outside the code section being profiled... Am I missing something?

Edit 2:

The different clocks I tried are: CLOCK_MONOTONIC_RAW, CLOCK_PROCESS_CPUTIME_ID and CLOCK_REALTIME. All of them offer the same behaviour.

Moreover, I verified the return value of clock_gettime() to discard failures.

Solution

Looks like you're being a victim of CPU frequency scaling.

CPU frequency scaling

Your CPU scaling driver regulates the CPU frequency according to various factors, but most importantly according to the current CPU load. There are various drivers available for the Linux kernel, and different scaling governors. Each driver can support a certain set of scaling governors and behave differently according to the currently active governor.

For example, the default driver on my laptop is intel_pstate, with available governors powersave and performance. If I change driver to acpi-cpufreq I also get ondemand, userspace and conservative.

The main difference between governors is implied by their name: the powersave governor will try to keep the CPU at a lower frequency to save power, while the performance governor will try to keep the CPU at a high frequency to make things go faster.

In some cases, the performance governor just fixes the CPU frequency to the maximum available (for me, this is what happens with the intel_pstate driver). In other cases, it will just try to keep it "high", but it still makes adjustments (for me, this is what happens with the acpi-cpufreq driver).

In any case, a combination of scaling driver and governor that dynamically regulates the CPU frequency, will take into account the CPU load in order to make frequency adjustments on the fly when possible. This can either happen when the kernel code is entered through a syscall, or on a scheduling tick when the kernel itself runs the scheduler to re-schedule the currently running processes.

What is happening?

You most probably have a combination of scaling driver and governor that dynamically sets the CPU frequency.

When your code runs without calling usleep(), the CPU is under heavy load and the frequency is mostly stable. However, if your code calls usleep(), the CPU load is reduced significantly, and the scaling driver lowers the CPU frequency. By the time the process is waken up by the kernel after the sleep, the frequency is much lower, and it takes some time before the scaling driver realizes that it needs to be boosted up. On top of this, if you keep sleeping in regular intervals, chances are that the driver will never have enough time to re-scale the frequency back up and the code will just run slower.

This also applies to printf(), which needs to make a write() syscall in order to print text, and to almost any other syscall. Switching between user space and kernel space slows down the process, thus making the frequency get lowered by the scaling driver. This does not happen for some syscalls like clock_gettime(), which is optimized to run in user space and does not require a context-switch.

Here's an example, on my machine, with a dynamic governor (e.g. ondemand):

You can clearly see the CPU frequency getting stuck at the maximum value during the first run (without usleep()), and then fluctuating up and down without enough time to get steady during the second run (with usleep()). You can in fact notice that the avg time for Test #2 is almost 3x that of Test #1. This also happens for acpi-cpufreq and governor performance on my machine.

Solution

If you are able to set up a different scaling driver/governor combination that keeps a fixed CPU frequency, you will see no difference (or anyway a very small difference) in timing between the two versions of your code.

Here's another example, on my machine, with a static CPU frequency (e.g. using the userspace governor and setting a fixed speed manually):

As you can see, both tests run in about the same time.

If you are not able to set a different scaling governor, try changing the scaling driver of your kernel. If running Debian or Ubuntu, you may have loadable modules for different scaling drivers.

You can see available scaling drivers and governors for your kernel by looking at your current kernel configuration:

cat /boot/config-$(uname -r) | grep FREQ

For example I can see:

...
CONFIG_X86_PCC_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ=m
...

Where m means "available as a module" (loadable with modprobe) and y means "built-in".

You could then, for example, try to do:

# Load acpi-cpufreq since we have CONFIG_X86_ACPI_CPUFREQ=m
sudo modprobe acpi_cpufreq 

# Switch driver
echo acpi-cpufreq | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver

# Check available governors for the driver
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_governors

Test code

Here's the benchmarking code used for the above GIFs:

#define _GNU_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <sched.h>
#include <sys/types.h>

/**
 * Marco Bonelli - 2020-03-02
 * https://stackoverflow.com/a/60481392/3889449
 *
 * Test the effect of different scaling governors on CPU frequency and
 * performance under:
 *
 *   1) Continuous heavy load.
 *   2) Intermittent and short heavy load.
 *
 * gcc -O3 scaling_governor_test.c -o test
 * ./test [N_RUNS] [N_CYCLES_PER_RUN] [TEST2_DELAY_US]
 */

#define DEFAULT_RUNS   1000
#define DEFAULT_CYCLES 1000 * 1000
#define DEFAULT_DELAY  100 * 1000

// Don't optimize this as GCC would basically trash the whole function.
#pragma GCC push_options
#pragma GCC optimize("O0")
void __attribute__ ((noinline)) func(unsigned n) {
    double sum = 1.0;

    for (unsigned i = 0; i < n; i++)
        sum += 0.001;
}
#pragma GCC pop_options

void warmup(unsigned runs, unsigned cycles) {
    for (unsigned n = 1; n <= runs; n++)
        func(cycles);
}

double bench(unsigned n) {
    struct timespec t0, t1;

    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t0);
    func(n);
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t1);

    return (t1.tv_sec - t0.tv_sec)*1000.0L + (t1.tv_nsec - t0.tv_nsec)/1000.0L/1000.0L;
}

void setup_affinity(void) {
    cpu_set_t set;

    CPU_ZERO(&set);
    CPU_SET(0, &set);

    if (geteuid() == 0) {
        if (sched_setaffinity(0, sizeof(set), &set) == 0)
            puts("Affinity set to CPU #0.");
        else
            perror("sched_setaffinity");
    } else {
        puts("Running as normal user, run as root to set CPU affinity.");
    }
}

int main(int argc, char **argv) {
    unsigned runs, cycles, delay;
    double cur, tot1, tot2, min, max, avg;

    if (argc < 2 || sscanf(argv[1], "%i", &runs) != 1 || runs < 1)
        runs = DEFAULT_RUNS;

    if (argc < 3 || sscanf(argv[2], "%i", &cycles) != 1 || cycles < 1)
        cycles = DEFAULT_CYCLES;

    if (argc < 4 || sscanf(argv[3], "%i", &delay) != 1 || delay < 1)
        delay = DEFAULT_DELAY;

    setup_affinity();

    printf("Benchmarking %u runs of %u cycles each.\n", runs, cycles);
    printf("Test #1 will proceed normally.\nTest #2 will usleep(%u) before each run.\n", delay);
    fputs("Warming up... ", stdout);
    fflush(stdout);

    warmup(10, cycles);

    puts("done.\n---");

    tot1 = 0;
    min = INFINITY;
    max = -INFINITY;

    for (unsigned n = 1; n <= runs; n++) {
        cur = bench(cycles);

        tot1 += cur;
        avg = tot1 / n;
        if (cur < min) min = cur;
        if (cur > max) max = cur;

        printf("\rTest #1: tot %-9.3f  avg %-7.3f  min %-7.3f  max %-7.3f [ms]", tot1, avg, min, max);
        fflush(stdout);
    }

    putchar('\n');

    tot2 = 0;
    min = INFINITY;
    max = -INFINITY;

    for (unsigned n = 1; n <= runs; n++) {
        usleep(delay);
        cur = bench(cycles);

        tot2 += cur;
        avg = tot2 / n;
        if (cur < min) min = cur;
        if (cur > max) max = cur;

        printf("\rTest #2: tot %-9.3f  avg %-7.3f  min %-7.3f  max %-7.3f [ms]", tot2, avg, min, max);
        fflush(stdout);
    }

    puts("\n---");

    if (tot1 < tot2)
        printf("Test #2 ran ~%.3fx slower than Test #1.\n", tot2/tot1);
    else if (tot1 > tot2)
        printf("Test #1 ran ~%.3fx slower than Test #2.\n", tot1/tot2);
    else
        puts("Reality is a simulation.");

    if (avg < 0.5)
        puts("Such low average times are not a good indicator. You should re-run the rest with different parameters.");

    return 0;
}