Is there a good way of optimising this code (x86-64) ?
mov dword ptr[rsp], 0;
mov dword ptr[rsp+4], 0
where the immediate values could be any values, not necessarily zero, but in this instance always immediate constants.
Is the original pair of stores even slow? Write-combining in the hardware and parallel operation of the μops might just make everything ridiculously fast anyway? I’m wondering if there is no problem to fix.
I’m thinking of something like (don’t know if the following instructions even exist)
mov qword ptr[rsp], 0
or
mov eax, 0;
mov qword ptr[rsp], rax ; assuming we can spare a register, a bad idea to corrupt one though
Yes, you can combine your two 32-bit writes into a single 64-bit write, like so:
mov QWORD PTR [rsp], 0
The immediate value is a 32-bit sign extended immediate, so it's not this simple if your second write is non-zero1, or if the MSB of the first write is 1. In that case, you can load a 64-bit constant using movabs
and write that. E.g., to write 1 and 2,
movabs rax, 0x200000001
mov QWORD PTR [rsp], rax
The constant 0x200000001
results in the right values being written into each 32-bit half.
This trick is definitely worth it for the zero case and maybe worth it for the non-zero case and Peter's answer goes into much more detail on the tradeoffs in that latter case.
Compilers can also make this optimization (they call it "store combining" or something like that), meaning you can play with this on godbolt.
1 Except in the special case that the sign extension gives you exactly what you want. I.e., the second value is exactly 0xFFFFFFFF
and the high bit of the first value is set.