Does NUMA impact memory bandwidth, or just latency?

I have a problem that is memory bandwidth limited -- I need to read a lot (many GB) of data sequentially from RAM, do some quick processing and write it sequentially to a different location in RAM. Memory latency is not a concern.

Is there any benefit from dividing the work between two or more cores in different NUMA zones? Equivalently, does working across zones reduce the available bandwidth?

Solution

For bandwidth-limited, multi-threaded code, the behavior in a NUMA system will primarily depend how "local" each thread's data accesses are, and secondarily on details of the remote accesses.

In a typical 2-socket server system, the local memory bandwidth available to two NUMA nodes is twice that available to a single node. (But remember that it may take many threads running on many cores to reach asymptotic bandwidth for each socket.)

The STREAM Benchmark, for example, is typically run in a configuration that allows almost all accesses from every thread to be "local". This is implemented by assuming "first touch" NUMA placement -- when allocated memory is first written, the OS has to create mappings from the process virtual address space to physical addresses, and (by default) the OS chooses physical addresses that are in the same NUMA node as the core that executed the store instruction.

"Local" bandwidth (to DRAM) in most systems is approximately symmetric (for reads and writes) and relatively easy to understand. "Remote" bandwidth is much more asymmetric for reads and writes, and there is usually significant contention between the read/write commands going between the chips and the data moving between the chips. The overall ratio of local to remote bandwidth also varies significantly across processor generations. For some processors (e.g., Xeon E5 v3 and probably v4), the interconnect is relatively fast, so jobs with poor locality can often be run with all of the memory interleaved between the two sockets. Local bandwidths have increased significantly since then, with more recent processors generally strongly favoring local access.

Example from the Intel Xeon Platinum 8160 (2 UPI links between chips):

Local Bandwidth for Reads (each socket) ~112 GB/s
Remote Bandwidth for Reads (one-direction at a time) ~34 GB/s
Local bandwidth scales perfectly in two-socket systems, and remote bandwidth also scales very well when using both sockets (each socket reading data from the other socket).

It gets more complicated with combined read and write traffic between sockets, because the read traffic from node 0 to node 1 competes with the write traffic from node 1 to node 0, etc.

Local Bandwidth for 1R:1W (each socket) ~101 GB/s (reduced due to read/write scheduling overhead)
Remote Bandwidth for 1R:1W (one socket running at a time) ~50 GB/s -- more bandwidth is available because both directions are being used, but this also means that if both sockets are doing the same thing, there will be conflicts. I see less than 60 GB/s aggregate when both sockets are running 1R:1W remote at the same time.

Of course different ratios of local to remote accesses will change the scaling. Timing can also be an issue -- if the threads are doing local accesses at the same time, then remote accesses at the same time, there will be more contention in the remote access portion (compared to a case in which the threads are doing their remote accesses at different times).