Search code examples
cssesimdintrinsicsavx512

How can I gather single bytes with AVX512 intrinsics, given a vector of int offsets?


I have a base address (uint8_t*) and a vector of 16 offsets (__m512i). I need to end up with a __m128i containing 16 bytes gathered from 16 different memory locations.

As for now I understood that there is no such primitive, and all I can use is

uint8_t base;
__m512i offsets;
__m512i values = _mm512_i32gather_epi32(base, offsets, 1);

and that gives me a __m512i where I have

Vjjj Vjjj Vjjj Vjjj Vjjj Vjjj Vjjj Vjjj Vjjj Vjjj Vjjj Vjjj Vjjj Vjjj Vjjj Vjjj

(j is junk, V is the value I'm interested in)

Now I need to repack the data so to end up with a vector with just the data I'm interested in, but I'm getting more and more confused and I don't even know if I'm following the correct approach.


Solution

  • The shuffle you're looking for is part of AVX512F: _mm512_cvtepi32_epi8 (VPMOVDB). Fun fact: it even comes in a memory-destination store form if you want, although on Skylake-avx512 hardware that's no more efficient than normal. (It does allow byte-masked stores on Xeon Phi without AVX512BW.)

    Yes, if you can safely read 3 bytes of junk past the end of each byte element without risk of faulting from touching an unmapped page, a dword gather + packing is likely your best bet. Especially if they're unlikely to split across cache-line or especially page boundaries. If there's a bias in your indices towards those worst-case byte positions, consider aligning your source data differently or doing something else.

    If there's any kind of pattern in the indices, that can make it more efficient to load + shuffle manually, especially if a single vector load can span multiple values that you want. Even if there's just a fixed stride, it's worth considering looping over the indices to insert elements one at a time with vpinsrb or something, as in AVX2 byte gather with uint16 indices, into a __m256i. But with recent hardware (Skylake) and wide vectors (especially AVX512), gathers are pretty good and can approach 0.5 elements per clock.


    You got the operand order wrong for _mm512_i32gather_epi32, and base of course needs to be a pointer, not a scalar uin8_t:

    __m128i bytegather(uint8_t *base, __m512i offsets)
    {
        __m512i values = _mm512_i32gather_epi32(offsets, base, 1);
        return _mm512_cvtepi32_epi8(values);   // pack with truncation.
    }
    

    For an AVX2 version with _mm256_i32gather_epi32, you'd have to use a different shuffle. Perhaps extract the high half, left shift it, word blend (vpblendw) so all the bytes you want are in one __m128i. Then vpshufb (_mm_shuffle_epi8) to put pack the 8 bytes you want down to the bottom of the register?

    Subtracting one or two from the indices in the high half before the gather could avoid needing a shift, so the byte you want is in a different place in the dword element. But note that means if index=0 you're loading from before the start of the table. So you can't do that if that could segfault. (And it might be a bad idea for performance).


    If you have multiple of these vectors and want to eventually build up a __m512i of bytes from 4 vectors of offsets, you could consider using 2-input pack instructions (like _mm512_packs_epi32 vpackssdw) with an eventual qword shuffle to fix up the in-lane behaviour. But those packs only have saturating versions, not truncating, and it would cost extra instructions to clear the high garbage from each input first.

    Instead maybe best to use _mm512_permutex2var_epi16 (vpermt2w) for the first step of that, although it costs multiple shuffle uops on Skylake-X and unfortunately even on Ice Lake where vpermb is single-uop. You'd want to count total shuffle uops to produce one __m512i from 4 __m512i inputs and see which one comes out cheapest, for that vs. truncating down to __m128i with _mm512_cvtepi32_epi8 and then building back up.