gcc assembly clang compiler-optimization memcpy

clang vs gcc for copying 3 bytes on x86_64 - number of mov's

What should optimized compiled code for copying 3 bytes from one place to another, say, using memcpy(,,3) look like, in terms of assembly instructions?

Consider the following program:

#include <string.h>
int main() {
  int* p = (int*) 0x10;
  int x = 0;
  memcpy(&x, p, 4);
  x = x * (x > 1 ? 2 : 3);
  memcpy(p, &x, 4);  
  return 0;
}

it's a bit contrived a will cause a segmentation violation, but I need those instructions so that compiling with -O3 doesn't make all of it go away. When I compile this (GodBolt, GCC 6.3 -O3), I get:

main:
        mov     edx, DWORD PTR ds:16
        xor     eax, eax
        cmp     edx, 1
        setle   al
        add     eax, 2
        imul    eax, edx
        mov     DWORD PTR ds:16, eax
        xor     eax, eax
        ret

great - a single mov of a DWORD (= 4 bytes) from memory to a register. Nice and optimized. Now let's change the memcpy(&x, p1, 4) into memcpy(&x, p1, 3)? The compilation result becomes:

main:
        mov     DWORD PTR [rsp-4], 0
        movzx   eax, WORD PTR ds:16
        mov     WORD PTR [rsp-4], ax
        movzx   eax, BYTE PTR ds:18
        mov     BYTE PTR [rsp-2], al
        mov     edx, DWORD PTR [rsp-4]
        xor     eax, eax
        cmp     edx, 1
        setle   al
        add     eax, 2
        imul    eax, edx
        mov     DWORD PTR ds:16, eax
        xor     eax, eax
        ret

I'm not much of an exprt on Intel X86_64 assembly (read: I can't even read it properly when it's complicated), so - I don't quite get this. I mean, I get what's happening in the first 6 instructions and why so many of them are necessary. Why aren't two moves sufficient? A mov WORD PTR int al and a mov BYTE PTR into ah?

... so, I came here to ask. As I was writing the question I noticed GodBolt also has clang as an option. Well, clang (3.9.0 -O3) does this:

main:                                   # @main
        movzx   eax, byte ptr [18]
        shl     eax, 16
        movzx   ecx, word ptr [16]
        or      ecx, eax
        cmp     ecx, 2
        sbb     eax, eax
        and     eax, 1
        or      eax, 2
        imul    eax, ecx
        mov     dword ptr [16], eax
        xor     eax, eax
        ret

which looks more like what I expected. What explains the difference?

Notes:

It's the same behavior essentially if I don't initialize x = 0.
Other GCC versions do about the same thing as GCC 6.3, but GCC 7 is down to 5 instead of 6 mov's.
Other versions of clang (starting from 3.4) do about the same thing.

The behavior is similar if we forego memcpy'ing for the following:

#include <string.h>

typedef struct {
  unsigned char data[3];
}  uint24_t;

int main() {
  uint24_t* p = (uint24_t*) 0x30;
  int x = 0;
  *((uint24_t*) &x) = *p;
  x = x * (x > 1 ? 2 : 3);
  *p = *((uint24_t*) &x);
  return 0;
}

If you want to contrast with what happens when the relevant code is in a function, have a look at this or the uint24_t struct version (GodBolt). Then have a look at what happens for 4-byte values.

Solution

The size three is an ugly size and compilers are not perfect.

The compiler cannot generate an access to a memory location you haven't requested, so it needs two moves.

While it seems trivial for you, remember that you asked for memcpy(&x, p, 4); which is a copy from memory to memory.
Evidently GCC and the older versions of Clang are not smart enough to figure it out there is no reason to pass for a temporary in memory.

What GCC is doing with the first six instructions is basically constructing a DWORD at [rsp-4] with the three bytes, as you requested

mov     DWORD PTR [rsp-4], 0              ;DWORD is 0

movzx   eax, WORD PTR ds:16               ;EAX = byte 0 and byte 1
mov     WORD PTR [rsp-4], ax              ;DWORD has byte 0 and byte 1

movzx   eax, BYTE PTR ds:18               ;EAX = byte 2
mov     BYTE PTR [rsp-2], al              ;DWORD has byte 0, byte 1 and byte 2

mov     edx, DWORD PTR [rsp-4]            ;As previous from henceon

It is using movzx eax, ... to prevent a partial register stall.

The compilers did a great job already by eliding the call to memcpy and as you said the example is "a bit contrived" to follow, even for a human. The memcpy optimisations must work for any size, including those that cannot fit a register. It's not easy to get it right every time.

Considering that L1 access latencies have been lowered considerably in the recent architectures and that [rsp-4] is very likely to be in the cache, I'm not sure it's worth messing with the optimisation code in the GCC source.
It is surely worth filing a bug for a missed optimisation and see what the developers has to say.