I was reading a text book and it has an exercise that write x86-64 assembly code based on C code
//Assume that the values of sp and dp are stored in registers %rdi and %rsi
int *sp;
char *dp;
*dp = (char) *sp;
and the answer is:
//first approach
movl (%rdi), %eax //Read 4 bytes
movb %al, (%rsi) //Store low-order byte
I can understand it but just wondering can't we do sth simple in the first place as:
//second approach
movb (%rdi), %al //Read one bytes only rather than read all four bytes
movb %al, (%rsi) //Store low-order byte
isn't the second approach more concise and straightforward compared to the first approach which is a little bit unnescceary since we only care the lower byte of %rdi
, and not really interested in its upper 3 bytes.
Yes, your byte-load way is correct but it's not actually more efficient on most CPUs.
TL:DR: Generally avoid writing to byte or 16-bit registers when you have equally convenient options that don't do that.
(And BTW, the suggestions you got in comments were both wrong: x86 is little-endian, and store-forwarding problems are very unlikely (although possible maybe on some older CPUs, IDK that might not be totally wrong).)
Writing a partial register (narrower than 32-bit so it doesn't implicitly zero-extend into the full register) has a false dependency on the old value on some microarchitectures. i.e. movb (%rdi), %al
decodes on Intel Haswell/Skylake as a micro-fused load+merge ALU operation. (Why doesn't GCC use partial registers?. Also for Intel Haswell/Skylake specifically, this has a lot of detail.)
It would be more efficient to movzbl (%rdi), %eax
to just do a zero-extending byte load.
Or since we can assume that the last store to (%rdi)
was dword or wider (so store-forwarding will be efficient if it's still in flight), it is actually most efficient to do a dword load with movl (%rdi), %eax
. That avoids possible partial register penalties, and has smaller machine-code size than movzbl
(smaller is better, as a tie-break between otherwise equal options in terms of uops). Also, some old AMD CPUs run movzbl
slightly less efficiently than a dword mov
load. (Like the zero-extending needs an ALU port).
(Most CPUs run movzbl
"for free" in a load port, some also run movsbl
sign-extension in a load port without needing any ALU port, notably Intel Sandybridge-family.)
Store forwarding is not a problem:
all (?) current CPUs can forward efficiently from a dword store to a byte reload of any of the individual bytes, and definitely the low byte, especially when the dword store is aligned (like a C int
will be). See https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/
Of course, if you have a use for char
value sign- or zero-extended into a register later, load that way.
Or even better, as @Ira points out, if you're optimizing this code along with something that stored to *sp
, you can ideally just use whatever is in the register and optimize away the store/reload. (It's undefined behaviour in C for any other thread to asynchronously change that memory because it's int *
, not volatile or _Atomic int*
.)