I am using sde (intel's emulator) to run avx512 code and do not have actual hardware to benchmark.
For some reason there is no information that I could find on comparative performance between compress + store and compressstore.
compress + store would store the whole register and not just the selected elements but I am fine with that. While compressstore has to mask the not selected elements.
What is better? There is no latency information on the intel's website as far as I can see.
UPD: AMD - ZEN4. According to this: https://www.mersenneforum.org/showthread.php?p=614191 ZEN4 perf of compressstoreu is very poor, so if the code might be running on AMD -compressstoreu should be avoided.
I looked in a slightly wrong place: the compress instructions are only avaliable for epi32 and those have latencies:
_mm256_mask_compress_epi32
has latency 6
_mm256_mask_compressstoreu_epi32
has latency 11
and the others seem to require VBMI2, which are not available on my target.
So seems like compress + store
should be better.