The api for shuffling only has support for the byte
and sbyte
//
// Summary:
// __m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)
//
// VPSHUFB ymm, ymm, ymm/m256
//
// Parameters:
// value:
//
// mask:
public static Vector256<sbyte> Shuffle(Vector256<sbyte> value, Vector256<sbyte> mask);
//
// Summary:
// __m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)
//
// VPSHUFB ymm, ymm, ymm/m256
//
// Parameters:
// value:
//
// mask:
public static Vector256<byte> Shuffle(Vector256<byte> value, Vector256<byte> mask);
How would you do a shuffle of other types? For example, say I have a Vector256<short>
and wanted to do a shuffle with a mask of something like [0, 1, 7, 7, 3, 3, 2, 0]
?
Would I have have to instead do it at the byte level? i.e convert the above mask into its byte equivalent?
Would I have have to instead do it at the byte level? i.e convert the above mask into its byte equivalent?
For a vector of (u)short
, usually yes (but it's more complicated), unless you can use AVX512 (for VPERMW
) or the indexes are lined up in pairs so that you can shuffle it as a vector of (u)int
.
For a vector of (u)int
, there is PermuteVar8x32
, which is generally more convenient anyway.
By the way Vector256.Shuffle
does have an overload to shuffle a vector of shorts, but in my tests at least it just calls some fallback method, so you probably don't want to rely on that.
In general, shuffling a vector of shorts with AVX2 is a bit more of a puzzle than just shuffling it as a vector of bytes - shuffling a vector of bytes is in general more complicated than calling Avx2.Shuffle
, which is really the issue here. Avx2.Shuffle
is part of the solution, but VPSHUFB
does not move bytes between the two 128-bit halves of a 256-bit vector. There are various solutions depending on what your indexes look like but in general the idea is to mostly rely on shuffling bytes, and handling movement between the two 128-bit parts separately.
For example, you can make a 256-bit vector that has two copies of the lower half of the data, another 256-bit vector that has two copies of the upper half of the data, shuffle each of these, then blend based on whether you want a byte from the lower or the upper part. In general you can do any 32 byte shuffle with that, and you can build a word shuffle on top of it.