Search code examples
x86sseavx

SSE/AVX: using float shuffles + casts as substitute for missing integer shuffle intrinsics?


Is it always ok to simply use float shuffles + casts as substitute for missing integer shuffle intrinsics in SSE/AVX, like this:

__m128i x = _mm_castps_si128( _mm_shuffle_ps ( _mm_castsi128_ps(y), ...  

In theory this should, of course, work with instructiuons that do not interpret the binary bit patterns of the vector elements and thus agnostic wrt. they contain floats or integers. However, I remember one post by (IIRC) @Peter Cordes who wrote that using float shuffles for integer registers works on "some" CPUs only.


Solution

  • It always works for correctness, the only risk is a cycle of extra latency on some CPUs, notably Nehalem where bypass-latency penalties are 2 cycles each way. It's fully fine on Sandybridge-family, no extra latency to forward to or from the shuffle unit for FP or SIMD-integer domains. Agner Fog mentions this in his microarch guide (https://agner.org/optimize/).

    SSE: shuffle (permutevar) 4x32 integers also discusses this.

    Fastest way to do horizontal SSE vector sum (or other reduction) re: shuffle performance details on old CPUs, and some mention of bypass delays (e.g. shufps on Core 2 runs in the integer domain, so actually has extra bypass latency when used between FP instructions.)

    implications of using _mm_shuffle_ps on integer vector also has some links to bypass-delay details.

    You can sometimes take advantage of psllq or psrlq 64-bit shifts to move around 32-bit floats within 64-bit chunks to set up for a blend, if you're bottlenecked on shuffle-unit throughput. But that typically does have a cycle of bypass latency if the input is coming directly from an FP math instruction.