Search code examples
x86intelcpu-architecturecpu-cachepersistent-memory

How does DC PMM (memory mode) cache coherence behave?


Current setup:
Most recent intel architectures today have non-inclusive L3 cache where each slice (+CHA) includes a "snoop filter" that contains the location information an L3 directory would have provided if it were inclusive (This design choice is likely to avoid coherence messages taking over mesh bandwidth). Most also enable "memory directories" by default, which can be used to filter remote snoops or otherwise change the timing properties of the local and remote portions of the coherence transaction. When a memory location belonging to a different socket is accessed, the RFO is directly sent to the QPI/UPI ring and not L3+CHA. Cores copy the Source Address Decoder (SAD) registers that L3 maintains, these registers determine which NUMA node is responsible for the physical address. Once the RFO reaches the home agent responsible, it decides if snoops must be sent to other sockets/cores and respond back to the caller (can do this in parallel). There is also OSB that let's L3 do speculative snooping if bandwidth is available.

The "memory directory" is one or more bits located with the cache line data in DRAM that indicate whether another coherence domain might have a modified copy of the cache line.
These bits aren't updated for loads from local cores/cache because L3/CHA will track that. After a write back invalidation of a M state cache line, the memory directory bit is cleared since only one L3/CHA can have the cache line in M state.

Intel DC PMEM,
From, Intel® 64 and IA-32 Architectures Optimization Reference Manual Section 2.1.31
(I suppose in memory mode, although they don't specify it in the section)

On systems with multiple processors, a directory is used for cache coherence. This directory is implemented as a distributed in-memory directory, with the coherence state of each cache line stored in metadata within the line itself in memory.
In cases where there are cores in different processors repeatedly reading the same set of lines in the Intel Optane DC Persistent Memory Module, there will be several writes to the Intel Optane DC Persistent Memory Module recording the change in the coherence state each time.

This indicates PMM uses memory directories.

These writes are called “directory writes” and tend to be random in nature. As a result, several of these writes can lower the effective Intel Optane DC Persistent Memory Module bandwidth that is available to the application.

Would normal DRAM also suffer from random directory writes in a similar setup?
Or does it not matter in DRAM which has a write b/w of 48GB/s while PMM has only ~2.3GB/s (1)?

Why does PMM need to use directory coherence protocol when the DRAM 'memory directory' exists?

Optane DC Persistent Memory Module may be accessed by different threads, and if these kind of patterns are observed, one option to consider is to change the coherence protocol for Intel Optane DC Persistent Memory Module regions from directory-based to snoop-based by disabling the directory system-wide.

Would RDMA requests to remote PMM need to go through remote DRAM as well?


Solution

  • Most recent intel architectures today have non-inclusive L3 cache where each slice (+CHA)

    Processors with the server uncore design have a non-inclusive L3 on a mesh interconenct since Skylake. Tiger Lake (TGL) is the first homogeneous (big cores only) microarchitecture with a client uncore design that includes a non-inclusive L3. See: Where data goes after Eviction from cache set in case of Intel Core i3/i7. But the CHA design isn't used in TGL.

    includes a "snoop filter" that contains the location information an L3 directory would have provided if it were inclusive

    A snoop filter is a directory. Both terms refer to the same hardware structure used to hold coherence information.

    When a memory location belonging to a different socket is accessed, the RFO is directly sent to the QPI/UPI ring

    The on-chip ring interconnect doesn't adhere to the QPI or UPI specifications. Theses interconnects are actually significantly different from each other. There are dedicated interfacing units between the on-chip interconnect and external interconnects that convert between the message formats. Intel uses QPI/UPI for links between chips.

    When a memory location belonging to a different socket is accessed, the RFO is directly sent to the QPI/UPI ring and not L3+CHA.

    You mean accessed from a core? All types of requests from a core to any address go through a caching agent, which could be the one collocated with that core or another CA in same NUMA domain. When a CA receives a request, it sends it to the SAD (which is inside the CA) to determine which unit should service the request. At the same time, depending on the type of the request, it's also sent to the associated L3 slice (if present and enabled) for lookup. For example, if the request is to read a data cache line in the E/F/S state (RdData), then an L3 lookup operation is performed in parallel. If it was a read from the legacy I/O space, then no lookup is performed. If a lookup is performed and the result of the lookup is a miss, the output from the SAD is used to determine where to send the request to.

    Once the RFO reaches the home agent responsible, it decides if snoops must be sent to other sockets/cores and respond back to the caller (can do this in parallel).

    A home agent (or the home agent functionality of a CHA) doesn't sends snoops locally. After a miss in the L3, assuming the home snooping mode, the following happens:

    • The request is sent to the home agent that owns the line, which will ultimately service the request.
    • A snoop request is sent to the CA that owns the line if the line is homed in a NUMA domain that is different from the one in which the requestor exists.
    • A snoop request is sent to each IIO unit in the same NUMA domain as the requestor (because there is a cache in each IIO unit).
    • A snoop request is sent to each IIO unit in the home NUMA domain.

    The HA then checks the directory cache (if supported and enabled) and if missed, it checks the directory in memory (if supported and enabled), and based on the result, it sends snoops to other NUMA domains.

    All responses are collected by the HA, which then eventually sends back the the requested line and updates the directory.

    I have no idea what you mean by "can do this in parallel."

    The "memory directory" is one or more bits located with the cache line data in DRAM that indicate whether another coherence domain might have a modified copy of the cache line.

    It's not just about tracking modified copies, but rather the presence of lines in any state.

    Note that all of the caching agents we're talking about here are in the same coherence domain. It's just one coherence domain. I think you meant another NUMA node.

    Would normal DRAM also suffer from random directory writes in a similar setup?

    Yes. The impact can be significant even for DRAM if there happens to be too many access to the directory and the directory cache is not supported or disabled. But the impact is substantially larger in 3D XPoint because writes have a much lower row buffer locality (even in general, not just directory writes) and the precharge time of 3D XPoint is much higher than of DRAM.

    Why does PMM need to use directory coherence protocol when the DRAM 'memory directory' exists?

    The coherence state is stored with each line whether it's in DRAM or 3D XPoint. It takes only one transaction to read both the state and the line, instead of potentially two transactions had all of the directory been stored in DRAM. I'm not sure which design is better performance-wise and by how much, but storing the state with each line is certainly simpler.

    Would RDMA requests to remote PMM need to go through remote DRAM as well?

    I don't understand the question. Why do you think it has to go through DRAM if the address of the request is mapped to a PMM?