Search code examples
cassemblyx86-64micro-optimizationinstructions

Instructions to copy the low byte from an int to a char: Simpler to just do a byte load?


I was reading a text book and it has an exercise that write x86-64 assembly code based on C code

//Assume that the values of sp and dp are stored in registers %rdi and %rsi

int *sp;
char *dp;
*dp = (char) *sp;

and the answer is:

//first approach

movl (%rdi), %eax    //Read 4 bytes
movb %al, (%rsi)     //Store low-order byte

I can understand it but just wondering can't we do sth simple in the first place as:

//second approach

movb (%rdi), %al    //Read one bytes only rather than read all four bytes
movb %al, (%rsi)     //Store low-order byte

isn't the second approach more concise and straightforward compared to the first approach which is a little bit unnescceary since we only care the lower byte of %rdi, and not really interested in its upper 3 bytes.


Solution

  • Yes, your byte-load way is correct but it's not actually more efficient on most CPUs.
    TL:DR: Generally avoid writing to byte or 16-bit registers when you have equally convenient options that don't do that.

    (And BTW, the suggestions you got in comments were both wrong: x86 is little-endian, and store-forwarding problems are very unlikely (although possible maybe on some older CPUs, IDK that might not be totally wrong).)


    Writing a partial register (narrower than 32-bit so it doesn't implicitly zero-extend into the full register) has a false dependency on the old value on some microarchitectures. i.e. movb (%rdi), %al decodes on Intel Haswell/Skylake as a micro-fused load+merge ALU operation. (Why doesn't GCC use partial registers?. Also for Intel Haswell/Skylake specifically, this has a lot of detail.)

    It would be more efficient to movzbl (%rdi), %eax to just do a zero-extending byte load.

    Or since we can assume that the last store to (%rdi) was dword or wider (so store-forwarding will be efficient if it's still in flight), it is actually most efficient to do a dword load with movl (%rdi), %eax. That avoids possible partial register penalties, and has smaller machine-code size than movzbl (smaller is better, as a tie-break between otherwise equal options in terms of uops). Also, some old AMD CPUs run movzbl slightly less efficiently than a dword mov load. (Like the zero-extending needs an ALU port).

    (Most CPUs run movzbl "for free" in a load port, some also run movsbl sign-extension in a load port without needing any ALU port, notably Intel Sandybridge-family.)


    Store forwarding is not a problem: all (?) current CPUs can forward efficiently from a dword store to a byte reload of any of the individual bytes, and definitely the low byte, especially when the dword store is aligned (like a C int will be). See https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/

    Of course, if you have a use for char value sign- or zero-extended into a register later, load that way.

    Or even better, as @Ira points out, if you're optimizing this code along with something that stored to *sp, you can ideally just use whatever is in the register and optimize away the store/reload. (It's undefined behaviour in C for any other thread to asynchronously change that memory because it's int *, not volatile or _Atomic int*.)