Search code examples
cperformancegccglibc

Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled?


I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.

Here's my code:

#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

int main() {
    char *s = calloc(1 << 20, 1);
    memset(s, 65, 1000000);
    clock_t start = clock();
    for (int i = 0; i < 128; ++i) {
        s[strlen(s)] = 'A';
    }
    clock_t end = clock();
    printf("%lld\n", (long long)(end - start));
    return 0;
}

On my machine it outputs:

$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415

Somehow, enabling optimizations causes it to execute longer.


Solution

  • Testing your code on Godbolt's Compiler Explorer provides this explanation:

    • at -O0 or without optimisations, the generated code calls the C library function strlen;
    • at -O1 the generated code uses a simple inline expansion using a rep scasb instruction;
    • at -O2 and above, the generated code uses a more elaborate inline expansion.

    Benchmarking your code repeatedly shows substantial variations from one run to another, but increasing the number of iterations shows that:

    • the -O1 code is much slower than the C library implementation: 32240 vs 3090
    • the -O2 code is faster than the -O1 but still substantially slower than the C library code: 8570 vs 3090.

    This behavior is specific to gcc and the GNU libc. The same test on OS/X with clang and Apple's Libc does not show significant differences, which is not a surprise as Godbolt shows that clang generates a call to the C library strlen at all optimisation levels.

    This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen has a more important impact than the lack of performance of the inline code for small strings. The strings in your benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.

    I improved this benchmark and tested various string lengths. It appears from the benchmarks on linux with gcc (Debian 4.7.2-5) 4.7.2 running on an Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz that the inline code generated by -O1 is always slower, by as much as a factor of 10 for moderately long strings, while -O2 is only slightly faster than the libc strlen for very short strings and half as fast for longer strings. From this data, the GNU C library version of strlen is quite efficient for most string lengths, at least on my specific hardware. Also keeping in mind that cacheing has a major impact on benchmark measurements.

    Here is the updated code:

    #include <stdlib.h>
    #include <string.h>
    #include <time.h>
    
    void benchmark(int repeat, int minlen, int maxlen) {
        char *s = malloc(maxlen + 1);
        memset(s, 'A', minlen);
        long long bytes = 0, calls = 0;
        clock_t clk = clock();
        for (int n = 0; n < repeat; n++) {
            for (int i = minlen; i < maxlen; ++i) {
                bytes += i + 1;
                calls += 1;
                s[i] = '\0';
                s[strlen(s)] = 'A';
            }
        }
        clk = clock() - clk;
        free(s);
        double avglen = (minlen + maxlen - 1) / 2.0;
        double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
        printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/call\n",
               avglen, ns / bytes, ns / calls);
    }
    
    int main() {
        benchmark(10000000, 0, 1);
        benchmark(1000000, 0, 10);
        benchmark(1000000, 5, 15);
        benchmark(100000, 0, 100);
        benchmark(100000, 50, 150);
        benchmark(10000, 0, 1000);
        benchmark(10000, 500, 1500);
        benchmark(1000, 0, 10000);
        benchmark(1000, 5000, 15000);
        benchmark(100, 1000000 - 50, 1000000 + 50);
        return 0;
    }
    

    Here is the output:

    chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
    average length       0 -> avg time:  14.000 ns/byte,  14.000 ns/call
    average length       4 -> avg time:   2.364 ns/byte,  13.000 ns/call
    average length      10 -> avg time:   1.238 ns/byte,  13.000 ns/call
    average length      50 -> avg time:   0.317 ns/byte,  16.000 ns/call
    average length     100 -> avg time:   0.169 ns/byte,  17.000 ns/call
    average length     500 -> avg time:   0.074 ns/byte,  37.000 ns/call
    average length    1000 -> avg time:   0.068 ns/byte,  68.000 ns/call
    average length    5000 -> avg time:   0.064 ns/byte, 318.000 ns/call
    average length   10000 -> avg time:   0.062 ns/byte, 622.000 ns/call
    average length 1000000 -> avg time:   0.062 ns/byte, 62000.000 ns/call
    chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
    average length       0 -> avg time:  20.000 ns/byte,  20.000 ns/call
    average length       4 -> avg time:   3.818 ns/byte,  21.000 ns/call
    average length      10 -> avg time:   2.190 ns/byte,  23.000 ns/call
    average length      50 -> avg time:   0.990 ns/byte,  50.000 ns/call
    average length     100 -> avg time:   0.816 ns/byte,  82.000 ns/call
    average length     500 -> avg time:   0.679 ns/byte, 340.000 ns/call
    average length    1000 -> avg time:   0.664 ns/byte, 664.000 ns/call
    average length    5000 -> avg time:   0.651 ns/byte, 3254.000 ns/call
    average length   10000 -> avg time:   0.649 ns/byte, 6491.000 ns/call
    average length 1000000 -> avg time:   0.648 ns/byte, 648000.000 ns/call
    chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
    average length       0 -> avg time:  10.000 ns/byte,  10.000 ns/call
    average length       4 -> avg time:   2.000 ns/byte,  11.000 ns/call
    average length      10 -> avg time:   1.048 ns/byte,  11.000 ns/call
    average length      50 -> avg time:   0.337 ns/byte,  17.000 ns/call
    average length     100 -> avg time:   0.299 ns/byte,  30.000 ns/call
    average length     500 -> avg time:   0.202 ns/byte, 101.000 ns/call
    average length    1000 -> avg time:   0.188 ns/byte, 188.000 ns/call
    average length    5000 -> avg time:   0.174 ns/byte, 868.000 ns/call
    average length   10000 -> avg time:   0.172 ns/byte, 1716.000 ns/call
    average length 1000000 -> avg time:   0.172 ns/byte, 172000.000 ns/call