Which is better? mask_compress + store or mask_compressstoreu

I am using sde (intel's emulator) to run avx512 code and do not have actual hardware to benchmark.

For some reason there is no information that I could find on comparative performance between compress + store and compressstore.

compress + store would store the whole register and not just the selected elements but I am fine with that. While compressstore has to mask the not selected elements.

What is better? There is no latency information on the intel's website as far as I can see.

Solution

UPD: AMD - ZEN4. According to this: https://www.mersenneforum.org/showthread.php?p=614191 ZEN4 perf of compressstoreu is very poor, so if the code might be running on AMD -compressstoreu should be avoided.

I looked in a slightly wrong place: the compress instructions are only avaliable for epi32 and those have latencies:

_mm256_mask_compress_epi32 has latency 6 _mm256_mask_compressstoreu_epi32 has latency 11 and the others seem to require VBMI2, which are not available on my target.

So seems like compress + store should be better.