Search code examples
assemblyx86sseattmicro-optimization

Why move 32-bit register to stack then from stack to xmm register?


I am compiling with gcc -m32 on a 64-bit machine.

What is the difference between the following? Note that this is the AT&T syntax.

# this
movd  %edx, %xmm0

# and this
movl  %edx, (%esp)
movd  (%esp), %xmm0

Solution

  • The only difference in machine state is that the 2nd version leaves a copy on the stack1.

    GCC's default tuning bounces through memory for some reason. (Recent GCC may have fixed this for some cases). It's generally worse on most CPUs most of the time, including AMD, although AMD's optimization manual did recommend it. See GCC bugs 80820 and 80833 re: GCC's integer <-> xmm strategies in general.

    Using movd would cost 1 ALU uop, vs. a store and a load uop, so it's fewer uops for the front-end but different uops for the back-end, so depending on surrounding code the store/reload strategy could reduce pressure on a specific execution port.

    Latency is better for ALU movd than for store/reload on all CPUs, so the only advantage to store/reload is possible throughput.

    Agner Fog says in his microarch pdf for Bulldozer (the CPU with the slowest movd %edx, %xmm0):

    The transport delays between the integer unit and the floating point/vector unit are much longer in my measurements than specified in AMD's Software Optimization Guide. Nevertheless, I cannot confirm that it is faster to move data from a general purpose register to a vector register through a memory intermediate, as recommended in that guide.


    Footnote 1: If you really want that, a separate store would usually still be a better choice to accomplish that state. Same # of uops, and lower latency (esp. on Intel CPUs. AMD Bulldozer / Steamroller has 10 / 5 cycle latency for movd (x)mm, r32/r64. 1 cycle on Intel.)

    movd %edx, %xmm0         # ALU int -> xmm transfer
    movl %edx, (%esp)        # and store a copy if you want it