Search code examples
cstringloopsfor-loopbenchmarking

c - what is the most efficient way to copying a string?


what is the most efficient way for the cpu (benchmark way) to copy a string?

I am new to c and i am currently copying a string like this

    char a[]="copy me";
    char b[sizeof(a)];
    for (size_t i = 0; i < sizeof(a); i++) {
        b[i] = a[i];
    }
    printf("%s",b); // copy me

Here is another alternative , a while loop is a little bit faster than a for loop (of what i have heard)

 char a[]="copy me";
 char b[sizeof(a)];
 char c[sizeof(a)];
    
void copyAString (char *s, char *t)
{
    while ( (*s++ = *t++) != '\0');
};

copyAString(b,a);

printf("%s",c);

Solution

  • Don't write your own copy loops when you can use a standard function like memcpy (when the length is known) or strcpy (when it isn't).

    Modern compilers treat these as "builtin" functions, so for constant sizes can expand them to a few asm instructions instead of actually setting up a call to the library implementation, which would have to branch on the size and so on. So if you're avoiding memcpy because of the overhead of a library function call for a short copy, don't worry, there won't be one if the length is a compile-time constant.

    But even in the unknown / runtime-variable length cases, the library functions will usually be an optimized version hand-written in asm that's much faster (especially for medium to large strings) than anything you can do in pure C, especially for strcpy without undefined behaviour from reading past the end of a buffer.

    Your first block of code has a compile-time-constant size (you were able to use sizeof instead of strlen). Your copy loop will actually get recognized by modern compilers as a fixed-size copy, and (if large) turned into an actual call to memcpy, otherwise usually optimized similarly.

    It doesn't matter how you do the array indexing; optimizing compilers can see through size_t indices or pointers and make good asm for the target platform. See this and this Q&A for examples of how code actually compiles. Remember that CPUs run asm, not C directly.
    This example is too small and too simplistic to actually be usable as a benchmark, though. See Idiomatic way of performance evaluation?


    Your 2nd way is equivalent to strcpy for an implicit-length string. That's slower because it has to search for the terminating 0 byte, if it wasn't known at compile time after inlining and unrolling the loop.

    Especially if you do it by hand like this for non-constant strings; modern gcc/clang are unable to auto-vectorize loops there the program can't calculate the trip-count ahead of the first iteration. i.e. they fail at search loops like strlen and strcpy.

    If you actually just call strcpy(dst, src), the compiler will either expand it inline in some efficient way, or emit an actual call to the library function. The libc function uses hand-written asm to do it efficiently as it goes, especially on ISAs like x86 where SIMD can help. For example for x86-64, glibc's AVX2 version (https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/strcpy-avx2.S.html) should be able to copy 32 bytes per clock cycle for medium-sized copies with source and destination hot in cache, on mainstream CPUs like Zen2 and Skylake.

    It seems modern GCC/clang do not recognize this pattern as strcpy the way they recognize memcpy-equivalent loops, so if you want efficient copying for unknown-size C strings, you need to use actual strcpy. (Or better, stpcpy to get a pointer to the end, so you know the string length afterwards, allowing you to use explicit-length stuff instead of the next function also having to scan the string for length.)

    Writing it yourself with one char at a time will end up using byte load/store instructions, so can go at most 1 byte per clock cycle. (Or close to 2 on Ice Lake, probably bottlenecked on the 5-wide front-end for the load / macro-fused test/jz / store.) So it's a disaster for medium to large copies with runtime-variable source where the compiler can't remove the loop.

    (https://agner.org/optimize/ for performance of x86 CPUs. Other architectures are broadly similar, except for how useful SIMD is for strcpy. ISAs without x86's efficient SIMD->integer ability to branch on SIMD compare results may need to use general-purpose integer bithacks like in Why does glibc's strlen need to be so complicated to run quickly? - but note that's glibc's portable C fallback, only used on a few platforms where nobody's written hand-tuned asm.)

    @0___________ claims their unrolled char-at-a-time loop is faster than glibc strcpy for strings of 1024 chars, but that's implausible and probably the result of faulty benchmark methodology. (Like compiler optimization defeating the benchmark, or page fault overhead or lazy dynamic linking for libc strcpy.)


    Related Q&As: