Search code examples
c++visual-c++avxfma

How to get data out of AVX registers?


Using MSVC 2013 and AVX 1, I've got 8 floats in a register:

__m256 foo = mm256_fmadd_ps(a,b,c);

Now I want to call inline void print(float) {...} for all 8 floats. It looks like the Intel AVX intrisics would make this rather complicated:

print(_castu32_f32(_mm256_extract_epi32(foo, 0)));
print(_castu32_f32(_mm256_extract_epi32(foo, 1)));
print(_castu32_f32(_mm256_extract_epi32(foo, 2)));
// ...

but MSVC doesn't even have either of these two intrinsics. Sure, I could write back the values to memory and load from there, but I suspect that at assembly level there's no need to spill a register.

Bonus Q: I'd of course like to write

for(int i = 0; i !=8; ++i) 
    print(_castu32_f32(_mm256_extract_epi32(foo, i)))

but MSVC doesn't understand that many intrinsics require loop unrolling. How do I write a loop over the 8x32 floats in __m256 foo?


Solution

  • Careful: _mm256_fmadd_ps isn't part of AVX1. FMA3 has its own feature bit, and was only introduced on Intel with Haswell. AMD introduced FMA3 with Piledriver (AVX1+FMA4+FMA3, no AVX2).


    At the asm level, if you want to get eight 32bit elements into integer registers, it is actually faster to store to the stack and then do scalar loads. pextrd is a 2-uop instruction on SnB-family, and Bulldozer-family. (and Nehalem and Silvermont, which don't support AVX).

    The only CPU where vextractf128 + 2xmovd + 6xpextrd isn't terrible is AMD Jaguar. (cheap pextrd, and only one load port.) (See Agner Fog's insn tables)

    A wide aligned store can forward to overlapping narrow loads. (Of course, you can use movd to get the low element, so you have a mix of load port and ALU port uops).


    Of course, you seem to be extracting floats by using an integer extract and then converting it back to a float. That seems horrible.

    What you actually need is each float in the low element of its own xmm register. vextractf128 is obviously the way to start, bringing element 4 to the bottom of a new xmm reg. Then 6x AVX shufps can easily get the other three elements of each half. (Or movshdup and movhlps have shorter encodings: no immediate byte).

    7 shuffle uops are worth considering vs. 1 store and 7 load uops, but not if you were going to spill the vector for a function call anyway.


    ABI considerations:

    You're on Windows, where xmm6-15 are call-preserved (only the low128; the upper halves of ymm6-15 are call-clobbered). This is yet another reason to start with vextractf128.

    In the SysV ABI, all the xmm / ymm / zmm registers are call-clobbered, so every print() function requires a spill/reload. The only sane thing to do there is store to memory and call print with the original vector (i.e. print the low element, because it will ignore the rest of the register). Then movss xmm0, [rsp+4] and call print on the 2nd element, etc.

    It does you no good to get all 8 floats nicely unpacked into 8 vector regs, because they'd all have to be spilled separately anyway before the first function call!