What should optimized compiled code for copying 3 bytes from one place to another, say, using memcpy(,,3)
look like, in terms of assembly instructions?
Consider the following program:
#include <string.h>
int main() {
int* p = (int*) 0x10;
int x = 0;
memcpy(&x, p, 4);
x = x * (x > 1 ? 2 : 3);
memcpy(p, &x, 4);
return 0;
}
it's a bit contrived a will cause a segmentation violation, but I need those instructions so that compiling with -O3
doesn't make all of it go away. When I compile this (GodBolt, GCC 6.3 -O3), I get:
main:
mov edx, DWORD PTR ds:16
xor eax, eax
cmp edx, 1
setle al
add eax, 2
imul eax, edx
mov DWORD PTR ds:16, eax
xor eax, eax
ret
great - a single mov
of a DWORD (= 4 bytes) from memory to a register. Nice and optimized. Now let's change the memcpy(&x, p1, 4)
into memcpy(&x, p1, 3)
? The compilation result becomes:
main:
mov DWORD PTR [rsp-4], 0
movzx eax, WORD PTR ds:16
mov WORD PTR [rsp-4], ax
movzx eax, BYTE PTR ds:18
mov BYTE PTR [rsp-2], al
mov edx, DWORD PTR [rsp-4]
xor eax, eax
cmp edx, 1
setle al
add eax, 2
imul eax, edx
mov DWORD PTR ds:16, eax
xor eax, eax
ret
I'm not much of an exprt on Intel X86_64 assembly (read: I can't even read it properly when it's complicated), so - I don't quite get this. I mean, I get what's happening in the first 6 instructions and why so many of them are necessary. Why aren't two moves sufficient? A mov WORD PTR
int al
and a mov BYTE PTR
into ah
?
... so, I came here to ask. As I was writing the question I noticed GodBolt also has clang as an option. Well, clang (3.9.0 -O3) does this:
main: # @main
movzx eax, byte ptr [18]
shl eax, 16
movzx ecx, word ptr [16]
or ecx, eax
cmp ecx, 2
sbb eax, eax
and eax, 1
or eax, 2
imul eax, ecx
mov dword ptr [16], eax
xor eax, eax
ret
which looks more like what I expected. What explains the difference?
Notes:
x = 0
.mov
's.The behavior is similar if we forego memcpy'ing for the following:
#include <string.h>
typedef struct {
unsigned char data[3];
} uint24_t;
int main() {
uint24_t* p = (uint24_t*) 0x30;
int x = 0;
*((uint24_t*) &x) = *p;
x = x * (x > 1 ? 2 : 3);
*p = *((uint24_t*) &x);
return 0;
}
If you want to contrast with what happens when the relevant code is in a function, have a look at this or the uint24_t struct version (GodBolt). Then have a look at what happens for 4-byte values.
The size three is an ugly size and compilers are not perfect.
The compiler cannot generate an access to a memory location you haven't requested, so it needs two moves.
While it seems trivial for you, remember that you asked for memcpy(&x, p, 4);
which is a copy from memory to memory.
Evidently GCC and the older versions of Clang are not smart enough to figure it out there is no reason to pass for a temporary in memory.
What GCC is doing with the first six instructions is basically constructing a DWORD at [rsp-4]
with the three bytes, as you requested
mov DWORD PTR [rsp-4], 0 ;DWORD is 0
movzx eax, WORD PTR ds:16 ;EAX = byte 0 and byte 1
mov WORD PTR [rsp-4], ax ;DWORD has byte 0 and byte 1
movzx eax, BYTE PTR ds:18 ;EAX = byte 2
mov BYTE PTR [rsp-2], al ;DWORD has byte 0, byte 1 and byte 2
mov edx, DWORD PTR [rsp-4] ;As previous from henceon
It is using movzx eax, ...
to prevent a partial register stall.
The compilers did a great job already by eliding the call to memcpy
and as you said the example is "a bit contrived" to follow, even for a human.
The memcpy
optimisations must work for any size, including those that cannot fit a register. It's not easy to get it right every time.
Considering that L1 access latencies have been lowered considerably in the recent architectures and that [rsp-4]
is very likely to be in the cache, I'm not sure it's worth messing with the optimisation code in the GCC source.
It is surely worth filing a bug for a missed optimisation and see what the developers has to say.