What's the difference between "Sub-NUMA Clustering" and "Hemisphere and Quadrant Modes" in Intel CPU?

In the technical overview published by Intel, "Sub-NUMA Clustering" and "Hemisphere and Quadrant Modes" are described separately. But the main difference between them is not clear.

In this answer, it says that "Inside quadrant or Hemisphere mode, the same LLC mapping is done as SNC, but it is exposed as one numa domain and one physical memory map."

In Intel 64 and IA-re Architectures Optimization Reference Manual, only "Sub-NUMA Clustering" is described in Chapter 10, "Hemisphere and Quadrant Modes" are not mentioned.

In this document, "Hemisphere and Quadrant Modes" is categorized together as "UMA-Based Clustering" and conflict with SNC.

Can I understand it this way: for the CPU side and LLC behavior side, these two modes are exactly the same, except for the different number of NUMA nodes exposed to the operating system.

Solution

You can simplify Quadrant and Hemisphere modes as a kind of "automagic" SNC (Sub NUMA Clustering) for non-NUMA aware software but it's not exactly like that.

This Xeon Phi (KNL) presentation, this Intel's patent and your original 4th generation Xeon Scalable product overview helped me link the pieces together. Core counts in mainstream Xeon Scalable processors are approaching the core counts in Xeon Phi, and it looks like Sub-NUMA Clustering worked well in Xeon Phi for them to use the same design in their mainstream CPUs.

For the sake of clarity let's first review what happens on an L2 miss. I'm handwaving the details here because I don't remember all the exact nomenclature and protocols. I'm just giving a high-level overview to understand the cluster modes and make sure this answer is useful to other readers.

When there's a miss in a core L2, the cache sends a request to a designated component called CHA (Cache Home Agent). A CHA is associated with a stop of the mesh: it manages a slice of the LLC and can send requests to memory controllers on LLC miss.
Basically, the CHA is the component that a core queries to interact with memory above the L2 and each core has its own CHA and LLC slice.

UMA cluster mode

Also known as COD (Cluster On Die) or, All-to-All/All-2-All.
This is the classic mode of operation. In the event of an L2 miss the memory (physical) address is used to designate any of all the CHAs available in the socket. This is done with a hash function that is designed to interleave requests from every core across all CHAs. The interleaving is done at the cache line granularity since this is the unit that is moved.
For example, if we have 28 cores and thus 28 CHA, ideally, a core requesting 28 consecutive cache lines (not in the L2) will spread the 28 requests across all the 28 CHAs.
This is not the end however: once the CHAs receives a request it needs to send it to a memory controller (iMC), which can be any of the memory controller (with at least a populated slot)! (Assuming it missed in LLC, or for write-back of dirty lines.)
So the core may need to reach a CHA across the whole socket only to have it request the memory from an iMC somewhere else in the socket distant from itself, more mesh hops away.

This picture from the presentation linked above shows the various steps.

As far as the OS is concerned, there is only one NUMA node (i.e. it's actually a UMA).

Beware the order of the step is: purple, blu, orange, red

Note: this picture is for the Intel Phi architecture, not for Xeon Scalable. However, the descriptions of the cluster modes are identical, so Intel probably reused them for the Xeon Scalable.

SNC mode

In this mode, the physical address space is partitioned into separate, equally sized, regions. In SNC-2 there are two regions, in SNC-4 there are four.
Each region is logically associated with a subset of the cores and the iMC. Again, SNC-2 creates two groups of cores and their closest iMC, SNC-4 creates four.
So for an address range Mx we said that we have an associated group Cx of cores and iMC.
All the L2 misses to the addresses belonging to the address range Mx are interleaved only across the CHA of the group Cx. Furthermore, the iMC queried are also from Cx (interleaved if there is more than one, as it could be for SNC-2).
If a core in Cx requests memory only from Mx then its requests will be served by a close CHA and iMC (read: in Cx).

In UMA mode to make sure that a program would consistently hit a CHA or an iMC close to the requesting core we would need to make strided accesses. Which is impractical for many data structures.
In SNC-n mode all the memory addresses of a cluster are grouped together so, with the help of firmware that exposes the necessary NUMA metadata (via ACPI), a NUMA-aware OS can allocate the memory for a program in the same node.

Here's the picture for SNC mode.

Quadrant end Hemisphere mode

These are a sort of hybrid methods, they were designed to make non-NUMA aware software perform betters.
When there is an L2 miss the CHA is designated as in UMA mode, i.e. with a hashing function that is designed to interleave consecutive lines over all the available CHAs.
However, when the CHAs receives a request (and it must be served from memory) it will direct it to its closest iMC. This reduces the latency compared to UMA mode, on average.
So, compared to SNC mode, these modes first act like UMA (when choosing the CHA) and then like SNC (as they don't route the requests far away).
Only a single node is exposed to the OS.

Quadrant and Hemisphere modes are not as performant as NUMA-aware software in SNC mode (because the CHA chosen can be anywhere) but I think they are better than having a cluster mode that behaves like SNC but with a single node (or no nodes) exposed to the OS. In fact, interleaving the CHAs will sometimes pick a close CHA and sometimes will not, while a memory block in the wrong address range in SNC would always pick "bad" CHAs (and the OS won't help in this hypothetical mode since the firmware told it the machine is UMA).

Here's the last picture.