How can I implement a fast copying function like memcpy()?

I've seen a few answers on how memcpy() can achieve faster speed than a naive byte-by-byte copy. Most of them suggest something along the lines of:

void *my_memcpy(void *dest, const void *src, size_t n) {
    uint64_t *d = dest;
    const uint64_t *s = src;
    n /= sizeof(uint64_t);

    while (n--)
        *d++ = *s++;

    return dest;
}

which to my understanding (correct me if I'm wrong) can violate the strict aliasing assumption and cause undefined behavior. To keep it simple assume that n and the alignment and size of src and dest are multiples of 8.

If my_memcpy can indeed cause undefined behavior, I want to know how memcpy manages to copy multiple bytes at a time without violating any compiler assumptions. An example of any working implementation for x64 would help.

Suggestions to use the library routine won't work. I'm actually not writing my own memcpy. I'm writing a function that can use a similar optimization but AFAIK is unavailable in the C standard.

Solution

Portably, you should copy on alignment basis, which isn't necessarily uint64_t. In theory, you should be using uint_fast8_t but in practice that one is apparently 1 byte big, 1 byte aligned on most systems. If portability isn't required, you can stick with uint64_t.

The next problem is that the pointers passed to memcpy are not necessarily pointing at an aligned address, as per the requirement of the standard function to work regardless of alignment. So you'll have to do something like this:

size_t prealign = (uintptr_t)src % _Alignof(uint64_t);
if(prealign != 0)
{
  // copy bytes up to next aligned address
}

Same for destination, and same for the end of the data.

which to my understanding (correct me if I'm wrong) can violate the strict aliasing assumption and cause undefined behavior.

Correct. So in order to copy uint64_t chunks you either have to write the code in inline assembler, or you have to disable strict aliasing in a non-standard way when compiling, such as gcc -fno-strict-aliasing.

The "real" library memcpy is treated as a special case by the compiler, as are many other such library functions. memcpy(&foo, &bar, sizeof(int)); will for example get translated to a single mov instruction, inlined in the caller code, without memcpy being called at all.

Another note about pointer aliasing is that you should restrict qualify the pointers as done with the real memcpy. This tells the compiler that it can assume that the dest and src pointers aren't the same, or that they overlap, meaning that the compiler doesn't need to add checks or overhead code for that scenario.

Amusingly, when I write the following naive copy function:

#include <stdint.h>
#include <stddef.h>

void foocpy (void* dst, const void* src, size_t n)
{
  uint8_t* u8_dst = dst;
  const uint8_t* u8_src = src;

  for(size_t i=0; i<n; i++)
  {
    u8_dst[i] = u8_src[i];
  }
}

Then the compiler gives me a tonne of fairly inefficient machine code. But if I simply add restrict to both pointers, the whole functions gets replaced with this:

foocpy:
        test    rdx, rdx
        je      .L1
        jmp     memcpy
.L1:
        ret

This again shows that the built-in memcpy is a treated as a special snowflake by the compiler.