Search code examples
assemblyx86intelcpu-architectureinstruction-set

Why doesn't Ice Lake have MOVDIRx like tremont? Do they already have better ones?


I notice that Intel Tremont has 64 bytes store instructions with MOVDIRI and MOVDIR64B.
Those guarantees atomic write to memory, whereas don't guarantee the load atomicity. Moreover, the write is weakly ordered, immediately followed fencing may be needed.
I find no MOVDIRx in IceLake.

Why doesn't Ice Lake need such instructions like MOVDIRx?

(At the bottom of page 15)
Intel® ArchitectureInstruction Set Extensions and Future FeaturesProgramming Reference
https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf#page=15


Solution

  • Ice Lake has AVX512, which gives us 64-byte loads + stores, but no guarantee of 64-byte store atomicity.

    We do get 64-byte NT stores with movntps [mem], zmm / movntdq [mem], zmm. Interestingly, NT stores don't support merge-masking to leave some bytes unwritten. That would basically defeat the purpose of NT stores by creating partial-line writes, though.

    Probably Ice Lake Pentium / Celeron CPUs still won't have AVX1/2, let alone AVX512 (probably so they can sell chips with defects in the upper 128 bits of the FMA units and/or register file on at least one core), so only rep movsb will be able to internally use 64-byte loads/stores on those CPUs. (IceLake will have the "fast short rep" feature, which may make it useful even for small 64-byte copies, useful in kernel code that can't use vector regs.)

    (Update: Ice Lake and later Pentium / Celeron finally have AVX1/2 and BMI1/2, but not AVX-512. And Alder Lake and later don't have AVX-512 even in their high-end models, probably for market segmentation since Intel won't support it even for CPUs with no E cores.)


    Possibly Intel can't (or doesn't want to) provide that atomicity guarantee on their mainstream CPUs, only on low-power chips that don't support multiple sockets, but I haven't heard any reports of tearing actually existing within a cache line on Intel CPUs. In practice, I think cached loads/stores that don't cross a cache-line boundary on current Intel CPUs are always atomic.

    (Unlike on AMD K10 where HyperTransport did create tearing on 8B boundaries between sockets, while no tearing could be seen between cores on a single socket. SSE instructions: which CPUs can do atomic 16B memory operations?)

    In any case, there's no way to detect this with CPUID, and it's not documented, so it's basically impossible to safely take advantage. It would be nice if there was a CPUID leaf that told you the atomicity width for the system and for within a single socket, so implementations that split 512-bit AVX512 ops into 256-bit halves would still be allowed....

    Anyway, rather than introducing a special instruction with guaranteed store atomicity, I think it would be more likely for CPU vendors to start documenting and providing CPUID detection of wider store atomicity for either all power-of-2-size stores, or for only NT stores, or something.

    (Update: Intel recently documented that the AVX feature bit implies 128-bit atomicity for aligned loads/stores like movaps. This retroactively documents / guarantees something that's been true for a long time: https://rigtorp.se/isatomic/)

    Making some part of AVX512 require 64-byte atomicity would make it much harder for AMD to support, if they follow their current strategy of half-width vector implementation. (Zen2 has 256-bit vector ALUs, making AVX1/AVX2 instructions mostly single-uop, but no AVX512 until Zen 4, unfortunately. AVX512 is a very nice ISA even if you only use it at 256-bit width, filling more gaps in what can be done conveniently / efficiently, e.g. unsigned int<->FP and [u]int64<->double.)

    So IDK if maybe Intel agreed not to do that, or chose not to for their own reasons.


    Use case for 64B write atomicity:

    I suspect the main use-case is reliably creating 64-byte PCIe transactions, not actually "atomicity" per-se, and not for observation by another core.

    If you cared about reading from other cores, normally you'd want L3 cache to backstop the data, not bypass it to DRAM. A seqlock is probably a faster way to emulate 64-byte atomicity between CPU cores, even if movdir64B is available.

    Skylake already has 12 write-combining buffers (up from 10 in Haswell), so it's (maybe?) not too hard to use regular NT stores to create a full-size PCIe transaction, avoiding early flushes. But maybe low-power CPUs have fewer buffers and maybe it's a challenge to reliably create 64B transactions to a NIC buffer or something.