Search code examples
avxavx2

AVX 32-bit integer to double precision float best practice


I have an array of 32-bit integers that I want to convert to doubles, and hope to use _mm256_cvtepi32_pd() to perform the conversion.

My issue is that this intrinsic converts only 4 of the 8 integers in the register into doubles.

The source array is actually a 64-bit struct coming from an FPGA where my integer value is only 24 bits at the top of the first 32-bits of the struct, I'm using _mm256_i32gather_epi32() and _mm256_srli_epi32() to extract and shift the data into a proper 8-element 32-bit integer YMM register, which I will also store (using _mm256_store_si256()) and want to keep a copy of.

I would like to know what the "best practice" to convert all 8 elements in the YMM register into doubles is.

Is the best plan to take the YMM, convert to double (using _mm256_cvtepi32_pd()), store with _mm256_store_si256(), then logical right shift the YMM by 64 bits and repeat the conversion on the other half? Or is there some other better approach?


Solution

  • If possible with equal efficiency, only make __m128i vectors of 4 int32_t elements in the first place, so you can use them as inputs to _mm256_cvtepi32_pd. (With manual gathers, this saves work. With hardware vpgatherdd gathers, one 8-element gather is cheaper than two 4-element vpgatherdd xmm, and it might depend on the hardware whether the cost difference pays for an extra vextracti128 shuffle that's necessary.)

    Given a __m256i, the low half is of course free with _mm256_castsi256_si128(v) as an input to _mm256_cvtepi32_pd(). Most compilers will also optimize _mm256_extracti128_si256(v, 0) to no instructions, too, just using the XMM low half of the YMM vector register.

    You're correct, the high half is a problem. You need to get the high 128 bits (not 64) to the bottom of a vector, i.e. as a __m128i. Using an ALU shuffle uop which AMD runs very efficiently, Intel like any lane-crossing shuffle:

    __m128i high = _mm256_extracti128_si256(v, 1);    //  vextracti128 xmm, ymm, 1
    

    The other option is store/reload, like storing the whole 256 bits with _mm256_store_si256 or just the high half with _mm_store_si128( _mm256_extracti128_si256(v, 1)); (which compilers will hopefully optimize to vextracti128 mem, ymm, 1).

    Store/reload is interesting because memory-source vcvtdq2pd ymm, mem runs as a single micro-fused uop on Intel CPUs, using the load-port's shuffle/broadcast capability instead of an port-5 shuffle uop for vcvtdq2pd ymm, xmm (2 uops on Intel and AMD). https://uops.info/

    Presumably the actual int<>FP conversion hardware wants the input data within the same 64-bit element as the output, which is why 32<->32-bit (dq2ps) and AVX512 64<->64-bit (qq2pd) conversions are single-uop, but different element sizes have a port 5 uop that's almost certainly a shuffle. Because the conversion instructions read or write the low 128 bits, instead of odd or even elements of a same-width source

    So vcvtdq2pd ymm, mem is a single micro-fused uop on Intel, although still 2 on Zen 4.
    But feeding it costs us an extra store uop. So it's break-even for total front-end uop throughput on Intel. But we're trading pressure on port 5 for an extra load + store-data + store-address uops. Assuming you use a 256-bit store rather than vextracti128 mem, ymm, 1, the store-address and store-data uops can micro-fuse, and the extra load micro-fuses with the conversion. vextracti128 mem, ymm, 1 can't micro-fuse; presumably the immediate takes up space in the internal uop, not leaving room to pack a load uop into the same ROB (reorder buffer) entry, and maybe not even in the uop cache. (On Skylake and earlier, store-address uops compete with loads except for port 7, and [rsp + disp8] is a simple addressing mode that can be allocated to port 7.) So Intel's design of using an immediate to make AVX potentially extensible to wider vectors didn't work at all; AVX-512 is a whole new thing with different encodings, so they just cost themselves efficiency for 256-bit insert/extract instructions. (vs. having the immediate be an implicit 1.)

    If you're doing gathers, you might be close to bottlenecking on load-port throughput, but probably not with other work going on, even on Skylake without Downfall microcode updates. Especially with any cache misses.


    AVX2 hardware gather efficiency vs. doing it manually

    How far apart are the elements you need to gather? Are any within 32 bytes of each other, to potentially set up for an AVX2 shuffle like vpermd?

    As Homer512 said, gather is slow on some CPU families, especially after microcode updates to fix MDS vulnerabilities, notably https://en.wikipedia.org/wiki/Downfall_(security_vulnerability) affecting Intel Skylake-family and Ice Lake-family (including Ice / Tiger / Rocket Lake). Only Alder Lake and newer are unaffected, and really old stuff like Broadwell (i3/5/7 i5xxx and Xeon v4 and earlier, pre-Scalable).

    Gather is also not great on AMD (16 cycle throughput cost on Zen 2, 8 on Zen 3 and 4), nor on old Intel (Haswell / Broadwell). It's also very bad on Gracemont E-cores: 30 cycle throughput cost per 256-bit gather of 8x 4-byte elements. https://uops.info/ - the instruction is vpgatherdd. And efficiency is even lower for 2 gathers of 4x 4-byte elements each (_mm_i32gather_epi32) which isn't twice as fast.

    So manual gather is worth considering even if you can't grab multiple useful elements with one _mm256_loadu_si256. Like vmovd + vpinsrd.
    That's _mm_loadu_si32(void*) (docs) and _mm_insert_epi32 which takes an int; if the source data isn't aligned and aliasing-safe to load as a 32-bit integer type, memcpy it to a temporary. You need GCC11.3 or 12 for a working _mm_loadu_si32.

    If your indexes are only easily available in a SIMD vector already, gather has an advantage. But if it's something like a fixed stride, you can use AVX2 hardware gathers or manual gathers equally effectively.


    Your data is packed tightly enough that you can just SIMD load + shuffle

    the data we receive from the FPGA board (PCIe DMA) are 64-bits in size and contain: |<24-bit interesting data>|<8-bit other data>|<24-bit pad>|<8-bit other data>|, repeated up to 8,388,608 times (up to 67MB). Typically the arrays are only 100-200k elements in size.

    That's much better than I anticipated. One 256-bit (32-byte) load can get 4 records into a __m256i, then it's just a dword shuffle (VPERMD) + right-shift to pack the integers you want into the 4 int32 elements in a __m128i to feed FP conversion.

    It would have been worth doing SIMD loads + shuffles even if the data you wanted had only barely fit two in a 32-byte load.

    // high 24 bits of every 64, as an unsigned bitfield converted to double
    __m256d grab_high_ints(const void *p)
    {
       __m256i vstructs = _mm256_loadu_si256((const __m256i*)p);
         __m256i shuf = _mm256_setr_epi32(1, 3, 5, 7,  0, 0, 0, 0);   // a smart compiler can load this constant with a 128-bit load, with implicit zero-extension.  We only care about the low half.
       __m128i vints = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(vstructs, shuf));  // VPERMD ymm, mem = 1 uop.  This is the AVX2 intrinsic which has the C operands backwards; permutexvar is the AVX-512 version.
       vints = _mm_srli_epi32(vints, 8);  // high 24 bits with zero-extension;  srai would sign-extend
       return _mm256_cvtepi32_pd(vints);
    }
    

    This should inline as 3 instructions; with the shuf control vector load hoisted out of the loop, just vpermd ymm, [mem] / vpsrld xmm, xmm, 8 / vcvtdq2pd ymm, xmm. This is a total of 4 uops; your Broadwell CPU can run it at one result per 2 cycles, bottlenecked on port 5 throughput (vpermd and vcvt).

    So there's room to execute other FP work (and/or a store) before this would even slow down at all, even if the data was hot in L1d cache.

    Or in practice bottlenecked on DRAM bandwidth since your arrays won't be hot even in L3 cache, they're too big for that unless you do some cache-blocking. So you definitely want to do some more processing while you have the data in a vector register, so the CPU can overlap that work with waiting for DRAM.

    Higher computational intensity is very important; don't make multiple passes over large arrays that don't even fit in L3 cache if you can avoid it.