Missing byte-granularity masked store in AVX

I am migrating code from SSE to AVX. The code uses _mm_maskmoveu_si128, which conditionally stores 16 bytes based on a mask. The AVX equivalent would be _mm256_maskmoveu_si256 for 32 bytes, but this instruction does not exist.

How can I emulate it efficiently ?

Solution

MMX/SSE2 byte-masked stores (MASKMOVDQU) have NT semantics and are not efficient on modern CPUs. e.g. 10 uops with 6-cycle throughput on Skylake. Or 75 uops with 18 cycle latency on Zen 4. And unless the mask is all-ones, you have a partial-line NT store which sucks on modern multi-core CPUs.

If you can do a non-atomic load/vpblendvb/store without violating multi-threading correctness, that should work well. (This requires AVX2 for _mm256_blendv_epi8 not just AVX, but presumably whatever you're doing with single bytes in a __m256i also requires AVX2.)

The only good masked stores (non-NT) are AVX vmaskmovps/pd (and AVX2 vpmaskmovd/q) with dword or qword granularity, or AVX-512BW with byte granularity (vmovdqu8 mem{k}, ymm).

AVX masked stores are very slow on AMD, but AVX-512 masked stores are actually efficient on Zen 4 (single uop apparently). IDK why they couldn't implement the microcode for vmaskmovps mem, ymm, ymm to compare into a mask and use that, like 2 uops instead of 42. (https://uops.info/).

Both AVX1 and AVX-512 masked stores are efficient on Intel, like 3 uops on Skylake for vmaskmovps (port 0 + port 4 (store-data) + port 2/3/7 (store-address). The port 0 uop is probably a vpmovd2m k, x/ymm to take the high bits of dword elements and make a k mask, in an internal k register usable only by microcode. Unfortunately it can't micro-fuse the store-address and store-data uops in the front-end either, as usual for instructions with uops in addition to the store.