AVX2 MaskLoad/MaskStore of ushorts?

I have been trying to figure out how you do a mask load/store with ushorts, as an experiment, more than an actual requirement.

There is no api call for MaskLoad/MaskStore that accepts a pointer to a ushort, so I thought you may have to just do it with an int pointer?, which is what I have attempted below.

In regard to TestUShort, I have setup an offset from the beginning of the array, to skip the first two ushorts

ushort* address = ptr + 2;

Then I have created a mask,

[0xFFFF0000, 0xFFFFFFFF, 0, 0xFFFFFFFF, 0, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF]

which I thought would behave as the following (X is include, 0 is mask out)

[X,0, X,X, 0,0, X,X, 0,0, X,X, X,X, X,X]

So I would expect

0, 0 <-- (first two skipped) ...12, 0, 12, 12, 0, 0, 12, 12, 0, 0, 12, 12, 12, 12, 12, 12

but instead I get :

0, 0 <-- (first two skipped) ...12, 12, 12, 12, 0, 0, 12, 12, 0, 0, 12, 12, 12, 12, 12, 12

Is this because you can only mask on 4 byte blocks? Is there a way to do this? I am also interested in the performance impact of such an operation on ushorts, with an offset and a mask, which I assume will be very costly due to non-alignment? As the code is currently not working, I have not benchmarked it yet as any timings have no real meaning if the code is not working.

  static unsafe void TestUShort()
  {
      ushort[] values = new ushort[256];
      Vector256<uint> mask = Vector256.Create<uint>([0xFFFF0000, 0xFFFFFFFF, 0, 0xFFFFFFFF, 0, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF]);
      Vector256<ushort> add = Vector256.Create<ushort>(12);

      fixed (ushort* ptr = values)
      {
          ushort* address = ptr + 2;
          uint* addressInt = (uint*)address;

          var v = Avx2.MaskLoad(addressInt, mask).As<uint, ushort>();
          v = Avx2.AddSaturate(v, add);

          Avx2.MaskStore(addressInt, mask, v.As<ushort, uint>());
      }
  }

  static unsafe void TestUInt()
  {
      uint[] uintValues = new uint[256];

      Vector256<uint> mask = Vector256.Create<uint>([0xFFFFFFFF, 0xFFFFFFFF, 0, 0xFFFFFFFF, 0, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF]);
      Vector256<uint> add = Vector256.Create<uint>(12);

      fixed (uint* ptr = uintValues)
      {
          var v = Avx2.MaskLoad(ptr, mask);
          v = Avx2.Add(v, add);

          Avx2.MaskStore(ptr, mask, v);
      }
  }

Solution

There is no AVX2 instruction that directly performs a masked store of words. You cannot use the dword masked store for this, because:

The mask bit for each data element is the most significant bit of that element in the first source operand. If a mask is 1, the corresponding data element is copied from the second source operand to the destination operand. If the mask is 0, the corresponding data element is set to zero in the load form of these instructions, and unmodified in the store form.

MASKMOVDQU (Sse2.MaskMove) can do a masked store on a per-byte basis, however it uses a non-temporal hint and that may be good (allegedly) in exactly the right kind of use case for which it is designed (at least the manual makes it sound like it's good for something, I've actually never successfully used it), that is mostly bad in normal cases. For example on Zen 3 it takes 75 µops and can be executed only once every 18 cycles. Also it's only a 128-bit operation, so you would need two of them.

In most cases you can implement a masked store by loading the destination, blending your new data with the old data according to the mask, and storing the whole thing back. There are two concerns that may prevent you from using that technique:

Unlike a proper masked store, touching an invalid page with a "blended store" will trigger a page fault. That makes it a poor fit for handling the partial vector that you may have at the end of an array. Depending on the circumstances you may be able to do a partially-overlapping last iteration that goes exactly up to the end of the array, but also partially re-writes some data that was written by the last "normal" iteration. Or you may just have to handle the last elements with scalar code.
In a multi-threaded scenario, "blended stores" are not safe for concurrent modification. Data may be lost. I don't have any suggestion for this except to avoid being in that scenario.

In the special case of adding 12 to some subset of the words, you can add 12 to the elements that you want to change and 0 to the elements that you don't want to change, subject to the same two concerns. It's fairly common to be able to do something like that instead of needing the full generality of blending with the old data, but of course sometimes you just need to blend.

I am also interested in the performance impact of such an operation on ushorts, with an offset and a mask, which I assume will be very costly due to non-alignment?

All AVX2-capable processors handle (normal) unaligned loads and stores quite well. It's normally not a big deal as it used to be on Core2 era processors. Split locks are still bad, but you're doing SIMD, not lock'ed RMW operations.