AVX MaskLoad/MaskStore performance

Usually, when writing a SIMD like function over a large array of data that might not divide cleanly by register sizes, you can do the bulk with SIMD and then do the last little bit using scalar like code.

However, for the code I am currently writing, it is not a simple loop of an array from beginning to end. Instead, the memory reads/writes are somewhat random, such that at any point in the loop I might need to read/write to an address that might result in reading/writing past the end of the array.

From a little research, I have seen that I can use MaskLoad and MaskStore. Whilst this solves my problem in not reading/writing out of bounds, it also kills the performance.

The MaskStore seems to increase the time taken by about 30%.

I'm wondering if there is an alternative I can use?

I have read about BlendVariable, but I don't think that helps, as you still have the issue of reading/writing past the bounds of the array.

Solution

Masked loads are cheap these days.
On modern CPUs, the performance is pretty close to normal full-vector loads.

Masked stores are indeed rather expensive.
For example, on my CPU with AMD Zen4 cores, according to uops.info Avx.MaskStore( float*, Vector256<float> ) i.e. _mm256_maskstore_ps decodes into a whopping 42 µops. It’s faster on Intel but still way slower than full-vector stores.

An easy workaround is storing the remainder vector into a stack-allocated buffer, then remSpan.Slice( 0, rem ).CopyTo( ... ) to copy the initial slice of the stored vector from the local buffer into the destination.

P.S. Note there’s a way to avoid solving the problem. Allocate a few more elements in your arrays making the storage size a multiple of SIMD vector size. This way you can use full-vector loads and stores to compute you stuff. Just ignore the few padding elements in the array when consuming the computed result.