assembly optimization x86-64 micro-optimization

In x86-64 asm: is there a way of optimising two adjacent 32-bit stores / writes to memory if the source operands are two immediate values?

Is there a good way of optimising this code (x86-64) ?

mov dword ptr[rsp], 0;
mov dword ptr[rsp+4], 0

where the immediate values could be any values, not necessarily zero, but in this instance always immediate constants.

Is the original pair of stores even slow? Write-combining in the hardware and parallel operation of the μops might just make everything ridiculously fast anyway? I’m wondering if there is no problem to fix.

I’m thinking of something like (don’t know if the following instructions even exist)

mov  qword ptr[rsp], 0

mov  eax, 0;
mov  qword ptr[rsp], rax    ; assuming we can spare a register, a bad idea to corrupt one though

Solution

Yes, you can combine your two 32-bit writes into a single 64-bit write, like so:

mov     QWORD PTR [rsp], 0

The immediate value is a 32-bit sign extended immediate, so it's not this simple if your second write is non-zero¹, or if the MSB of the first write is 1. In that case, you can load a 64-bit constant using movabs and write that. E.g., to write 1 and 2,

movabs  rax,  0x200000001
mov     QWORD PTR [rsp], rax

The constant 0x200000001 results in the right values being written into each 32-bit half.

This trick is definitely worth it for the zero case and maybe worth it for the non-zero case and Peter's answer goes into much more detail on the tradeoffs in that latter case.

Compilers can also make this optimization (they call it "store combining" or something like that), meaning you can play with this on godbolt.

¹ Except in the special case that the sign extension gives you exactly what you want. I.e., the second value is exactly 0xFFFFFFFF and the high bit of the first value is set.