performance x86-64 cpu-cache memory-bandwidth

Random memory write is slower than random memory read?

I'm trying to figure out memory access time of sequential/random memory read/write. Here's the code:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>

#define PRINT_EXCECUTION_TIME(msg, code)                                       \
  do {                                                                         \
    struct timeval t1, t2;                                                     \
    double elapsed;                                                            \
    gettimeofday(&t1, NULL);                                                   \
    do {                                                                       \
      code;                                                                    \
    } while (0);                                                               \
    gettimeofday(&t2, NULL);                                                   \
    elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0;                                \
    elapsed += (t2.tv_usec - t1.tv_usec) / 1000.0;                             \
    printf(msg " time: %f ms\n", elapsed);                                     \
  } while (0);

const int RUNS = 20;
const int N = (1 << 27) - 1;
int *data;

int seqR() {
  register int res = 0;
  register int *data_p = data;
  register int pos = 0;

  for (register int j = 0; j < RUNS; j++) {
    for (register int i = 0; i < N; i++) {
      pos = (pos + 1) & N;
      res = data_p[pos];
    }
  }

  return res;
}

int seqW() {
  register int res = 0;
  register int *data_p = data;
  register int pos = 0;

  for (register int j = 0; j < RUNS; j++) {
    for (register int i = 0; i < N; i++) {
      pos = (pos + 1) & N;
      data_p[pos] = res;
    }
  }

  return res;
}

int rndR() {
  register int res = 0;
  register int *data_p = data;
  register int pos = 0;

  for (register int j = 0; j < RUNS; j++) {
    for (register int i = 0; i < N; i++) {
      pos = (pos + i) & N;
      res = data_p[pos];
    }
  }

  return res;
}

int rndW() {
  register int res = 0;
  register int *data_p = data;
  register int pos = 0;

  for (register int j = 0; j < RUNS; j++) {
    for (register int i = 0; i < N; i++) {
      pos = (pos + i) & N;
      data_p[pos] = res;
    }
  }

  return res;
}

int main() {
  data = (int *)malloc(sizeof(int) * (N + 1));
  assert(data);

  for (int i = 0; i < N; i++) {
    data[i] = i;
  }

  for (int i = 0; i < 10; i++) {
    PRINT_EXCECUTION_TIME("seqR", seqR());
    PRINT_EXCECUTION_TIME("seqW", seqW());
    PRINT_EXCECUTION_TIME("rndR", rndR());
    PRINT_EXCECUTION_TIME("rndW", rndW());
  }

  return 0;
}

I used gcc 6.5.0 with -O0 to prevent optimization but got result like this:

seqR time: 2538.010000 ms
seqW time: 2394.991000 ms
rndR time: 40625.169000 ms
rndW time: 46184.652000 ms
seqR time: 2411.038000 ms
seqW time: 2309.115000 ms
rndR time: 41575.063000 ms
rndW time: 46206.275000 ms

It's easy to understand that sequential access is way faster than random access. However, it doesn't make sense to me that random write is slower than random read while sequential write is faster than sequential read. What reason could cause this?

In addition, am I safe to say memory bandwidth for seqR is (20 * ((1 << 27) - 1) * 4 * 1024 * 1024 * 1024)GB / (2.538)s = 4.12GB/s?

Solution

Sounds normal. All x86-64 CPUs (and most other modern CPUs) use write-back / write-allocate caches so a write costs a read before it can commit to cache, and an eventual write-back.

with -O0 to prevent optimization

Since you used register on all your locals, this is one of the rare times when this didn't make your benchmark meaningless.

You could have just used volatile on your arrays, though, to make sure every one of those accesses happened in order, but leave it up to the optimizer how to make that happen.

Am I safe to say memory bandwidth for seqR is (20 * ((1 << 27) - 1) * 4 * 1024 * 1024 * 1024)GB / (2.538)s = 4.12GB/s?

No, you have an extra factor of 2^30 and 10^9 in your numerator. But you did it wrong and got close to the right number anyway.

The correct calculation is RUNS * N * sizeof(int) / time bytes per second, or that divided by 10^9 GB/s. Or divided by 2^30 for base 2 GiB/s. Memory sizes are usually in GiB, but you can take your pick with bandwidth; DRAM clock speeds are normally things like 1600 MHz, so base-10 GB = 10^9 is certainly normal for theoretical max bandwidths in GB/s.)

So 4.23 GB/s in base-10 GB.

Yes, you initialized the array first so neither timed run is triggering page-faults, but I might still have used the 2nd run after the CPU has warmed up to max turbo, if it hadn't already.

But keep in mind this is un-optimized code. That's how fast your un-optimized code ran, and doesn't tell you much about how fast your memory is. It's probably CPU bound, not memory.

Especially with a redundant & N in there to match the CPU work of the rndR/W functions. HW prefetching is probably able to keep up with 4GB/s, but it's still not even reading 1 int per clock cycle.