memory benchmarking bandwidth memory-bandwidth

Why sysbench memory read benchmark shows higher bandwidth than the theoretical limit?

I get confusing results from memory benchmarks comparing them to the theoretical limit of my system memory, and would like to confirm what is the reason behind it. Namely, it looks like the write benchmarks do match the theoretical limit for the bandwidth, but the read benchmark runs about twice faster than the limit. Is that something to expect on practice?

Other stackexchange sites suggest a simple explanation why reads are faster than writes: writing to memory means charging-discharging the memory cells, when reading is a passive operation hence should be faster. Is that really so? I am not sure, because it is never mentioned in the calculations of the theoretical limit of the bandwidth. I thought that the memory transfers were not limited by what happens on the memory chip, and they were limited only by the memory-CPU bus.

I have an AMD PC with 1 CPU socket, 4 DDR4 sticks and 2 memory channels, running at 2133MHz with 8 bytes (64 bits) width according to sudo lshw -C memory. I.e. it is exactly like in this SO answer. Following that answer, and this Intel post, the theoretical limit for the bandwidth is:

  8 (bytes/Transactions)
* 2 (memory channels)
* 2.133GT/s (transactions of the RAM module)
= 34.128 GB/s

Here I am not sure if I need to multiply the 2133MHz from lshw -C memory by 2 to account for double data rate or no. This memory is CMK32GX4M1D3000C16 from Corsair. And I don't find a clear spec from them. Corsair website quotes it as "SPD Speed". Some other places quote it as "Data Transfer rate: 2133 MHz". So, I assume it is the data transfer rate, not the bus clock, and I do not need to multiply by 2.

When I run a simple custom program with a loop of memcpy between static volatile input and output arrays of about 512MB size, with fixed sizes of data for memcpy of 8 bytes, I get something very close to the 34 GB/s. The program looks like this:

alignas(64) volatile uint8_t data_in  [...];
alignas(64) volatile uint8_t data_out [...];

for (unsigned n_rep; n_rep < max_repeatsl; n_rep++) {
  uint8_t* data_out_ptr = data_out;
  uint8_t* data_in_ptr = data_in;

  for (unsigned long long i_packet=0; ...) {
    memcpy(data_out_ptr, data_in_ptr, SIZE_TO_MEMCPY*sizeof(uint8_t));
    data_out_ptr += ...
    data_in_ptr += ...
  }
}

So, this program reads and writes from and to memory, not the CPU cache. And it does show something very close to 34GB/s of bandwidth, as expected.

Then I run sysbench:

sysbench memory --memory-block-size=1M --memory-total-size=20G --memory-oper=read run

And it shows about 60 GB/s. When I run it with --memory-oper=write it sits around 27GB/s. These speeds remain the same down to --memory-block-size=256KB. And then they unexpectedly get slower at smaller block sizes.

My understanding of this is:

sysbench memory benchmark is not reliable with smaller blocks. Here is a nice blog post about it. Which should mean that the 27GB/s limitation of the --memory-oper=write benchmark comes from the sysbench itself. If it could run on smaller blocks correctly, then I assume the write benchmark would reach 34 Gb/s like my simple memcpy program.
memory reads are indeed faster than writes for some reason. The sysbench memory read benchmark is correct here and shows true numbers for the data transfers from the memory to the CPU. But my memcpy program does not show it, because it writes to the memory, therefore it sits at the limit of the memory writes.

Is that right?

Does it mean that the typical calculations of max memory bandwidth actually show the max write bandwidth, and the reads alone can be as much as twice faster?

I am just surprised by it a bit. I thought that the memory bandwidth is not limited by what happens on the memory chip. And the max bandwidth is limited by the memory bus alone.

Strangely, if I define volatile uint8_t static_byte and substitute memcpy for static_byte = *data_in_ptr; in the memcpy program, it runs at about 22GB/s instead of more than 34 as in the sysbench memory read benchmark. I am not sure why it runs slower. Is it because it has to write into the same volatile byte, so the memory access commands cannot go out to the bus in parallel?

Solution

I measured the memcpy benchmark performance on the weekend, and it seems to clarify everything. The conclusions from those measurements:

yes, the max memory bandwidth is 34.128 GB/s (i.e. 2.133GT/s are indeed transactions of the CMK32GX4M1D3000C16 from Corsair, not the clock, no need for the x2 DDR factor)
that bandwidth is indeed shared between reads and writes, i.e. the maximum throughput of the memcpy benchmark is somewhere around 17 GB/s
I get about 10 GB/s (10000 MB/s) in the memcpy benchmark running 1 process on 1 core, when the benchmark is compiled into efficient assembly
- 10 GB/s is the throughput of the memcpy benchmark, i.e. it is whole move of the bytes from and back to the memory. I don't double that number just because that's what I have in the log. Sorry about that.
- the 34 GB/s measurement in the question was a mistake: the compiler optimized the outer "repeat" loop in the memcpy code, and turned it into an inner loop. So, in the assembly, the CPU copies the same 16 bytes 1000 times (or whatever the repeat factor), and it hits the cache all the time.
and it reaches to 12.5 GB/s when running 2 memcpy processes on 2 cores (each process copies at 6.2 GB/s); and about 11 GB/s for 4 or 6 memcpy processes
from this I conclude:
- the realistic memory bandwidth maximum is around 12.5 GB/s, maybe a bit higher
- 1 memcpy benchmark process does not completely saturate the memory bus, because something else gets saturated before that. I.e. probably the LS unit cannot issue the memory transactions quite fast enough, although it is close to saturate the memory bus with its 10 GB/s.
I did not look into the sysbench code, but it is clear that it repeats memory accesses inside those "blocks". I.e. you test the cache that way. What I still do not understand about its numbers is the multi-threaded case. When I set the "block" to a number larger than the cache (128MB, when my cache is 8MB), it does show reasonable numbers for 1 thread, but with 4 threads it shows something like 60 GB/s again. It must be hitting cache somehow.

If the LS unit limitation plays a role, I wonder what you'll get if you construct the benchmark to do more loads than stores or vice versa. I.e. my CPU is of Zen 2 architecture, and it might have more resources for loads then stores? I did not read that info well. (AMD manual would be better.)

The LS unit contains a 44-entry load queue (LDQ) which receives load operations from dispatch through either of the two load AGUs...A 48-entry store queue (STQ), up from 44 entries in Zen, receives store operations from dispatch, a linear address computed by any of the three AGUs,..

Update: no, it cannot be a limitation in the LS unit. Because of the faulty case that I show in the following Details: the code got compiled with inverted order of the loops, the repeat loop was performed on the same set of bytes, i.e. in cache, and the apparent throughput reached something like 34 GB/s. The bottleneck must be somewhere in going through the cache and issuing actual memory bus transfers.

Details & measurements behind these conclusions

The benchmark moved the same overall number of bytes, but with memcpy calls of different sizes: 16 or 32 bytes. And I do the memcpy calls within "packets" that can be either the same size as the memcpy or larger. I.e. the memcpy calls either copy all bytes in the memory array, back-to-back, or make some sparse but simple well define pattern. The code looks like this:

#define SIZE_TO_MEMCPY 16 // 32
#define PACKET_SIZE    16 // 32

const static long long unsigned n_packets = 2*16*1024*1024; // 1*16*1024*1024
alignas(CACHE_LINE_SIZE) volatile uint8_t data_in  [n_packets][PACKET_SIZE];
alignas(CACHE_LINE_SIZE) volatile uint8_t data_out [n_packets][PACKET_SIZE];

static long long unsigned n_repeat    = 1000;

int main() {
  // warmup data_in and data_out

  long long unsigned n_bytes_copied = 0;
  time_start();
  for (n_repeat) {
    for (i=0; i < n_packets; i++) {
      memcpy(data_out[i], data_in[i], SIZE_TO_MEMCPY*sizeof(uint8_t));
      n_bytes_copied += SIZE_TO_MEMCPY;
    }
  }
  time_end();
}

And there are 3 cases:

// case 1, sparse 16 bytes
#define SIZE_TO_MEMCPY 16
#define PACKET_SIZE    32
const static long long unsigned n_packets = 2*16*1024*1024;

// case 2, dense 16 bytes
#define SIZE_TO_MEMCPY 16
#define PACKET_SIZE    16
const static long long unsigned n_packets = 2*16*1024*1024;

// case 3, dense 32 bytes
#define SIZE_TO_MEMCPY 32
#define PACKET_SIZE    32
const static long long unsigned n_packets = 1*16*1024*1024;

In the first case the compiler inverted the order of the loops and the benchmark showed 34 GB/s because it was hitting the cache. The case 1 looked like this in assembly:

  │180:┌─→movdqa   (%rdx),%xmm0
       │    │{
  0.48 │    │  mov      $0x3e8,%eax
       │    │  nop
  0.30 │190:│  movaps   %xmm0,(%rbx)
       │    │
 97.19 │    │  sub      $0x1,%eax
  0.07 │    │↑ jne      190
       │    │printf("\nrun memcpy\n");
       │    │  add      $0x10,%rdx
  1.01 │    │  add      $0x20,%rbx
       │    ├──cmp      %rcx,%rdx
  0.20 │    └──jne      180

I.e it does only stores in the jne 190 inner loop. And you could see it in the PMC statistics. There were 34Billions of store events, and 0.8B of loads. For my AMD CPU, Linux perf has the following events for the store and load instructions: ls_dispatch.ld_dispatch for load and ls_dispatch.store_dispatch for store.

The case 2 with dense 16 byte packets got compiled in the most efficient code. It is the case that produced the 10GB/s throughput, almost saturating the memory bus. The compiler unfolded the code and used the YMM registers to move 32 bytes in 1 instruction:

       │500:┌─→mov $0x20,%ecx
  0.01 │505:│ prefetcht0 0x80(%rsi)
  0.09 │ │ prefetcht0 0xc0(%rsi)
  0.00 │ │ prefetcht0 0x1080(%rsi)
  0.06 │ │ prefetcht0 0x10c0(%rsi)
  0.00 │ │ prefetcht0 0x2080(%rsi)
  0.22 │ │ prefetcht0 0x20c0(%rsi)
  0.01 │ │ prefetcht0 0x3080(%rsi)
  3.49 │ │ prefetcht0 0x30c0(%rsi)
  0.00 │ │ vmovdqu (%rsi),%ymm0
  1.25 │ │ vmovdqu 0x20(%rsi),%ymm1
  0.01 │ │ vmovdqu 0x40(%rsi),%ymm2
  6.19 │ │ vmovdqu 0x60(%rsi),%ymm3
  0.03 │ │ vmovdqu 0x1000(%rsi),%ymm4
  9.59 │ │ vmovdqu 0x1020(%rsi),%ymm5
  0.05 │ │ vmovdqu 0x1040(%rsi),%ymm6
 13.04 │ │ vmovdqu 0x1060(%rsi),%ymm7
  0.08 │ │ vmovdqu 0x2000(%rsi),%ymm8
  6.49 │ │ vmovdqu 0x2020(%rsi),%ymm9
  0.06 │ │ vmovdqu 0x2040(%rsi),%ymm10
 11.43 │ │ vmovdqu 0x2060(%rsi),%ymm11
  0.08 │ │ vmovdqu 0x3000(%rsi),%ymm12
 15.93 │ │ vmovdqu 0x3020(%rsi),%ymm13
  0.10 │ │ vmovdqu 0x3040(%rsi),%ymm14
 15.85 │ │ vmovdqu 0x3060(%rsi),%ymm15
  0.17 │ │ sub $0xffffffffffffff80,%rsi
  0.03 │ │ vmovntdq %ymm0,(%rdi)
  0.93 │ │ vmovntdq %ymm1,0x20(%rdi)
  0.14 │ │ vmovntdq %ymm2,0x40(%rdi)
  0.73 │ │ vmovntdq %ymm3,0x60(%rdi)
  0.16 │ │ vmovntdq %ymm4,0x1000(%rdi)
  1.48 │ │ vmovntdq %ymm5,0x1020(%rdi)
  0.17 │ │ vmovntdq %ymm6,0x1040(%rdi)
  1.81 │ │ vmovntdq %ymm7,0x1060(%rdi)
  0.22 │ │ vmovntdq %ymm8,0x2000(%rdi)
  1.56 │ │ vmovntdq %ymm9,0x2020(%rdi)
  0.20 │ │ vmovntdq %ymm10,0x2040(%rdi)
  1.88 │ │ vmovntdq %ymm11,0x2060(%rdi)
  0.24 │ │ vmovntdq %ymm12,0x3000(%rdi)
  2.90 │ │ vmovntdq %ymm13,0x3020(%rdi)
  0.19 │ │ vmovntdq %ymm14,0x3040(%rdi)
  2.79 │ │ vmovntdq %ymm15,0x3060(%rdi)
  0.25 │ │ sub $0xffffffffffffff80,%rdi
       │ │ dec %ecx
  0.06 │ │↑ jne 505
  0.00 │ │ add $0x3000,%rdi
  0.00 │ │ add $0x3000,%rsi
  0.00 │ │ dec %r10
  0.01 │ └──jne 500

The case 3 with 32 bytes got compiled into a less efficient code, using XMM in a loop. It reached about 5.6 GB/s of throughput:

  │170:┌─→mov      %rbx,%rdx
       │    │
       │    │  lea      data_in,%rax
       │    │  nop
       │180:│  movdqa   (%rax),%xmm2
 33,06 │    │  movdqa   0x10(%rax),%xmm3
  1,43 │    │  add      $0x20,%rax
  1,15 │    │  add      $0x20,%rdx
  1,11 │    │  movaps   %xmm2,-0x20(%rdx)
 60,51 │    │  movaps   %xmm3,-0x10(%rdx)
  2,25 │    │  cmp      %rcx,%rax
  0,45 │    │↑ jne      180
       │    │  sub      $0x1,%esi
       │    └──jne      170

For cases 2 and 3, perf stat showed the following PMC counts:

// case 2, YMM registers
48,851,280,938      instructions                     #    0.22  insn per cycle              (27.27%)
26,038,321,059      ls_dispatch.ld_dispatch          #  503.669 M/sec                       (27.27%)
17,448,793,278      ls_dispatch.store_dispatch       #  337.518 M/sec                       (27.27%)


// case 3, a loop with XMM registers
137,021,531,363      instructions                     #    0.34  insn per cycle              (27.27%)
34,314,004,546      ls_dispatch.ld_dispatch          #  357.610 M/sec                       (27.27%)
34,248,739,749      ls_dispatch.store_dispatch       #  356.930 M/sec                       (27.27%)

These numbers make total sense. 34B loads and stores is exactly what you expect for 16-byte wide memory accesses:

2*16*1024*1024 * 16 [bytes] * 1000 [repeats] / 16 [bytes of move] =
= 33.55 Billions

The case 2 with YMM registers makes 17B stores and about 26B of loads because of the prefetch instructions. The numbers of instructions also make sense.

It was great to see this come together with just measuring a trivial benchmark. Unfortunately, I do not have proper support for PMU events in Linux and/or on this CPU. Somehow, with newer version of Linux kernel, perf even lost the "stalled backend" event, which used to be there. So, I cannot really dig into what saturates when 1 core runs this benchmark at 10GB/s throughput. Why the CPU runs at 0.22 IPC and does not issue more memory bus transactions to reach 12.5 GB/s? The LS unit or cache or something else in the CPU core must be at its limit.