x86 legacy instruction-set amd-processor mmx

How did the legacy 3DNow! instruction set store results to memory or integer registers?

Just for fun I'm reviewing legacy (deprecated) instructions from 3DNow! set introduced by AMD, and I'm trying to understand how were they used. All instructions seem to be encoded following this pattern:

instruction destination_MMn_register_operand, source_MMn_register_or_memory_operand

where destinationRegister = destinationRegister -operation- source

Like, for instance, pfadd mm0, mmword ptr [rcx] (0F 0F 01 9E):

Would add 2 packed floats from memory pointed by rcx to 2 packed floats stored in mm0 and keep result in mm0.

So it seems like those 3DNow instructions always have an mm register as a destination.

But how were you supposed to get the results out of those mm registers?

In other words, there's no mov mmword ptr [rcx], mm0, or mov rax, mm0 instructions.

Solution

As @harold says, storing to memory or extracting to an integer register is already covered by MMX movq (both) or movd (low), or punpckhdq+movd to extract just the high float. (Or with MMXEXT introduced with SSE1, pshufw to copy-and-shuffle into another register, not destroying the original.) Similarly for loading.

 PF2ID  mm0, [esi]     ; 3DNow! load 2 floats and convert to 32-bit integer
; basic MMX instructions to use the result
; could do the same thing with 32-bit FP bit patterns
 movq  [edi], mm0      ; store both
 movd  eax, mm0        ; extract low half
 punpckhdq  mm0, mm0   ; broadcast high half
 movd  edx, mm0        ; extract high half

I used 32-bit addressing modes so this code can work in 32-bit mode for compat with CPUs before K8. In 64-bit mode you have SSE2 which makes 3DNow! mostly pointless. Except for working with exactly 2 floats at a time on CPUs like K8 where 128-bit SIMD instructions like addps run as 2 uops. Or if you had some existing code developed for 3DNow! and haven't ported it to SSE2 yet. 64-bit mode does have movq rax, mm0, just like movq rax, xmm0.

The one thing you can't do is turn an 3DNow! float into an x87 80-bit float without a store/reload.

What might have been potentially useful is a version of EMMS that expands a 32-bit float into an 80-bit x87 long double in st0, along with setting the FPU back into x87 mode instead of MMX mode¹. Or maybe even do that for multiple mm registers into multiple x87 registers?

i.e. it would be a shortcut for movd dword [esp], mm0 / emms / fld dword [esp] to set up for further scalar FP after a SIMD reduction.

Remember that these are IEEE754 floats; you normally don't want them in integer registers unless you're picking apart their bit-fields (e.g. for an exp or log implementation), but you can do that with MMX shift/mask instructions.

PF2ID or PF2IW to convert to 32-bit or 16-bit integer of course give you integer data in MMX registers, at which point you're in normal MMX territory.

But movd and fld are cheap, so they didn't bother making a special instruction just to save the reload latency. Also, it might have been slow to implement as a single instruction. Even though x86 is not a RISC ISA, having one really complex instruction is often slower than multiple simpler instructions, especially before decoding to multiple uops was fully a thing. Look at in-order P5 Pentium for an example of how using a RISCy subset of x86 was more efficient there, allowing it to pipeline and pair better if you avoid instructions like push/pop. (That's all changed; push/pop and memory-destination ALU instructions are fine if you need the load/store anyway, and don't have a use for the value in a register.)

3dNow!'s femms leaves the MMX/3dNow! register contents undefined, only setting the tag words to unused instead of preserving the mapping from MMX registers to/from x87 register contents. See http://refspecs.linuxbase.org/AMD-3Dnow.pdf for an official AMD manual. IDK if AMD's microarchitectures just dropped the register-renaming info or what, but probably making store / femms / x87-load the fast way saves a lot of transistors.

Or even FEMMS is still somewhat slow, so they don't want to encourage coders to leave/re-enter MMX/3dNow! mode at all often.

Fun fact: 3dNow! PREFETCHW (prefetch with write intent) is still used, and has its own CPUID feature bit.

See my answer on What is the effect of second argument in _builtin_prefetch()?

Intel CPUs soon added support for decoding it as a NOP (so software like 64-bit Windows can use it without checking), but Broadwell and later actually prefetch with a RFO to get the cache line in MESI Exclusive state, rather than Shared, so it can flip to Modified without additional off-core traffic.

The CPUID feature bit indicates that it really will prefetch.

Footnote 1:

Remember that the MMX registers alias the x87 registers, so no new OS support was needed to save/restore architectural state on context switches. It wasn't until SSE that we got new architectural state. So it wasn't until SSE2+3dNow! that a 3dNow! float to SSE2 double could make sense without switching back to x87 mode. And you could movq2dq xmm0, mm0 + cvtps2pd xmm0, xmm0.

They could have had a float->double in a mm register, but the fld / fst hardware was only designed for float or double->80-bit and 80-bit->float or double. And the use-case for that is limited; if you're using 3dNow!, just stick to float.