avx512 strided gather with arbitrary stride

I know in AVX512 you can perform strided gathers with strides of 1,2,4,8. However what if I have an arbitrary stride that can be anywhere between 10-1000? The stride is known at compile time. I understand then the instruction won't be the bottleneck, the memory probably will be. Is _mm512_set_ps the most effective way to do this?

Solution

strided gathers with strides of 1,2,4,8

No, there's no special support for that; maybe you're thinking of ARM/ARM64 NEON vld4 4-way deinterleave?

In x86 you can use 1,2,4, or 8 as a scale factors for an index vector for vpgatherdd / vpgatherdps, but if you just want every 2nd element it's better to manually shuffle (e.g. _mm512_permutex2var_ps to grab alternate floats from 2 input vectors), getting many useful elements with one wide load instead of accessing cache once per element.

But in your case, with a minimum stride of 10, at most 2 elements will come from the same 16 x 32-bit 512-bit vector, and with wider strides not even one per vector.

So you can use vpgatherdps with _mm512_add_epi32(idx, _mm512_set1_epi32(16 * stride)) in a loop. Or better, just use a fixed vector of indices and increment the base pointer. You might generate that vector of indices with _mm512_mullo_epi32(_mm512_setr_epi32(0,1,2,3,...,15), _mm512_set1_epi32(stride)). Since a float is 4 bytes wide, use a scale factor of 4 with your gathers.

Even if you need to handle huge arrays, incrementing the pointer instead of the vector elements avoids any need for 64-bit indices, as well as minimizing the number of vector uops. (Valuable when using 512-bit vectors on current CPUs.)

IIRC, Intel's optimization manual has a section about strided loads and the tradeoff in manual gather vs. using gather instructions. Gather instructions become relatively better the wider your vectors are (2/clock load throughput but only 1/clock shuffle throughput for most shuffles), so especially for 512-bit vectors its likely a win to use vector shuffles.