assembly x86-64 micro-optimization instruction-encoding

Efficiently loading both RAX and R8 with the same small positive number

Instead of writing mov rax, 1 (7 byte encoding 48, C7, C0, 01, 00, 00, 00), I can write mov eax, 1 (5 byte encoding B8, 01, 00, 00, 00) relying on the automatic zeroing of the high dword.

For copying RAX to R8, I can choose between mov r8, rax (3 byte encoding 49, 89, C0) or mov r8d, eax (3 byte encoding 41, 89, C0) again relying on the automatic zeroing of the high dword.

Is there any ratio at all to prefer one method of copying over the other?
The REX prefix cannot be avoided since R8 is one of the 'new' registers, and so REX.B is needed. Under this circumstance, is it desirable to try to avoid having the REX.W bit set?

Solution

If you need a REX prefix anyway, it doesn't really matter what bits are set in it. Almost all 64 bit instructions are as fast as their 32 bit counterparts; exceptions include the usual suspects (multiplications and divisions).

As for which of these two is faster: despite having a longer dependency chain, the second variant

mov eax, 1
mov r8d, eax

is likely to be faster as the second instruction is likely to be handled with register renaming, producing no latency and no µops at all. There are somewhat obscure exceptions in which register renaming may not fire; use a microarchitectural analyser to find these. In such cases, it may be better to load two immediates as these can be executed in parallel.