Search code examples
c++performanceloopsmemcpystrlen

Is copying in a loop less efficient than memcpy()?


I started to study IT and I am discussing with a friend right now whether this code is inefficient or not.

// const char *pName
// char *m_pName = nullptr;

for (int i = 0; i < strlen(pName); i++)
    m_pName[i] = pName[i];

He is claiming that for example memcopy would do the same like the for loop above. I wonder if that's true, I don't believe.

If there are more efficient ways or if this is inefficient, please tell me why!

Thanks in advance!


Solution

  • I took a look at actual g++ -O3 output for your code, to see just how bad it was.

    char* can alias anything, so even the __restrict__ GNU C++ extension can't help the compiler hoist the strlen out of the loop.

    I was thinking it would be hoisted, and expecting that the major inefficiency here was just the byte-at-a-time copy loop. But no, it's really as bad as the other answers suggest. m_pName even has to be re-loaded every time, because the aliasing rules allow m_pName[i] to alias this->m_pName. The compiler can't assume that storing to m_pName[i] won't change class member variables, or the src string, or anything else.

    #include <string.h>
    class foo {
       char *__restrict__ m_pName = nullptr;
       void set_name(const char *__restrict__ pName);
       void alloc_name(size_t sz) { m_pName = new char[sz]; }
    };
    
    // g++ will only emit a non-inline copy of the function if there's a non-inline definition.
    void foo::set_name(const char * __restrict__ pName)
    {
        // char* can alias anything, including &m_pName, so the loop has to reload the pointer every time
        //char *__restrict__ dst = m_pName;  // a local avoids the reload of m_pName, but still can't hoist strlen
        #define dst m_pName
        for (unsigned int i = 0; i < strlen(pName); i++)
            dst[i] = pName[i];
    }
    

    Compiles to this asm (g++ -O3 for x86-64, SysV ABI):

    ...
    .L7:
            movzx   edx, BYTE PTR [rbp+0+rbx]      ; byte load from src.  clang uses mov al, byte ..., instead of movzx.  The difference is debatable.
            mov     rax, QWORD PTR [r12]           ; reload this->m_pName    
            mov     BYTE PTR [rax+rbx], dl         ; byte store
            add     rbx, 1
    .L3:                                 ; first iteration entry point
            mov     rdi, rbp                       ; function arg for strlen
            call    strlen
            cmp     rbx, rax
            jb      .L7               ; compare-and-branch (unsigned)
    

    Using an unsigned int loop counter introduces an extra mov ebx, ebp copy of the loop counter, which you don't get with either int i or size_t i, in both clang and gcc. Presumably they have a harder time accounting for the fact that unsigned i could produce an infinite loop.

    So obviously this is horrible:

    • a strlen call for every byte copied
    • copying one byte at a time
    • reloading m_pName every time through the loop (can be avoided by loading it into a local).

    Using strcpy avoids all these problems, because strlen is allowed to assume that it's src and dst don't overlap. Don't use strlen + memcpy unless you want to know strlen yourself. If the most efficient implementation of strcpy is to strlen + memcpy, the library function will internally do that. Otherwise, it will do something even more efficient, like glibc's hand-written SSE2 strcpy for x86-64. (There is a SSSE3 version, but it's actually slower on Intel SnB, and glibc is smart enough not to use it.) Even the SSE2 version may be unrolled more than it should be (great on microbenchmarks, but pollutes the instruction cache, uop-cache, and branch-predictor caches when used as a small part of real code). The bulk of the copying is done in 16B chunks, with 64bit, 32bit, and smaller, chunks in the startup/cleanup sections.

    Using strcpy of course also avoids bugs like forgetting to store a trailing '\0' character in the destination. If your input strings are potentially gigantic, using int for the loop counter (instead of size_t) is also a bug. Using strncpy is generally better, since you often know the size of the dest buffer, but not the size of the src.

    memcpy can be more efficient than strcpy, since rep movs is highly optimized on Intel CPUs, esp. IvB and later. However, scanning the string to find the right length first will always cost more than the difference. Use memcpy when you already know the length of your data.