How can we swap byte in a Vector256 (System.Runtime.Intrinsics.X86)?

I'm optimizing a Gaussian Filter in c#, using the new System.Runtime.Intrinsics.X86 namespace (Single instruction, multiple data) found in .net core 3.0.

I'm working with Vector256 for the biggest part of the algorithm, but at the end i must do a division. I found how to go from my Vector256 to 2 Vector256 to be able to do a divison, but i'm having trouble bringing it back to a ushort version so i can output the data. I'm trying to use Avx2.PackUnsignedSaturate(vector1, vector2), which effectively give me a Vector256 but the items have been mixte (kind of a endianness, but the individual value of each of my ushort are there)

All i need would be to swap a couple byte in the middle. Using a regular loop (without SIMD) to put back the value in the output would be easy, but also a waste of time (well, i think ... hard to say if i can't benchmark the simd solution)

-I've tried a shuffle on the Vector256 casted as bytes. I'm not able to achieved what i need, it seem the byte movement are confined in their respective 128bits. -I've tried looking at MSDN, there's no example or descriptions on those new functions, so for most of them, i have no idea what they are doing -I've tried looking at Intel guide (https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf), While they do explain some stuff, those that i think i would need (XCHG or BSWAP maybe?), could not find them in the namespace.

var initialVector1 = System.Runtime.Intrinsics.Vector256.Create(1, 2, 3, 4, 5, 6, 7, 8);
var initialVector2 = System.Runtime.Intrinsics.Vector256.Create(9, 10, 11, 12, 13, 14, 15, 16);

var convertedBackToUshort = Avx2.PackUnsignedSaturate(initialVector1, initialVector2);

The content of convertedBackToUshort should be : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16

but i'm getting : 1, 2, 3, 4, 9, 10, 11, 12, 5, 6, 7, 8, 13, 14, 15, 16

Using Avx2.Shuffle(convertedBackToUshort, mask), i'm unable to bring the 9 back on the right side (trying with several for loop to "brute-force the mask" without success)

Solution

Avx2.PackUnsignedSaturate aka VPACKUSWB/VPACKUSDW, like many 256bit operations, works like two the 128bit versions of the operation side by side instead of like a scaled up version of the 128bit version. There is a nice image on this page. There are cross-lane shuffles too, for example Avx2.Permute4x64 which you can use to put the blocks in their "natural" order if you wanted. It takes a Vector256<UInt64> but that doesn't matter, just reinterpret your vector before and after.

The pack operations pair well with the unpack functions (eg Avx2.UnpackLow), if you use those rather than the "convert" functions then you should not need additional permutes.

Using a scalar loop would indeed not be efficient, not only because it's a scalar loop but also because converting between vectors and a "bunch of scalars" has overhead.

There is a deeper problem in this question, because a Gaussian filter (or in general any convolution really) does not normally include a division, and therefore doesn't end up needing this step. Since your data is ushorts, you could use Avx2.MultiplyHigh to scale by a factor between 0 and 1, without having to do anything complicated.