Search code examples
ssesimdavx2avx512

SSE: does mask store affect the bytes that were masked out


In intel intrinsics guide there are a few that allow to store parts of a wide register. I mean _mm_maskstore, _mm_mask_store and _mm_mask_compressstoreu like.

The question is, is it OK to use them if my thread doesn't own part of the cacheline where they'd land or it's past the end of the current page?

Example:

struct S {
  std::int16_t write_here[10];
  std::atomic<std::int16_t> other_thread_can_use_this;
};

Can I write with one simd store to write_here? Or it can corrupt the data from other_thread_can_use_this (by loading it and then writing that back again for example)?


Solution

  • They do fault-suppression and maintain correctness; See When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements?

    It definitely does not do a non-atomic RMW.

    This all applies to SSE's (slow NT-store) maskmovdqu, AVX's relatively efficient dword/qword masked vmaskmovps/pd and vpmaskmovd/q, as well as AVX512 masked stores.

    But it can be slow.

    AVX vmaskmov fully-masked stores to read-only pages are very slow, taking a microcode assist for every instruction. (So perform very badly in a loop over an array doing if(a[i] == x) a[i] = y; if there are no changes needed, and the page was "clean" and COW mapped to a zero page.)

    I'm not sure how it performs when the full vector splits across two cache lines in the same page, and one of them would miss in cache, but all the elements of that not-present line are masked out. You'd hope that that side of the store just wouldn't end up in the store buffer at all, so there'd be no reason for the core to RFO it (gain exclusive access to it).

    Again, architecturally there's no effect on bytes that were masked out, only possible performance.