Why is accessing every other cache line slower on x86, not matching Intel's documented cache behavior?

According to Intel's optimization manual, the L1 data cache is 32 KiB and 8-way associative with 64-byte lines. I have written the following micro-benchmark to test memory read performance.

I hypothesize that if we access only blocks that can fit in the 32 KiB cache, each memory access will be fast, but if we exceed that cache size, the accesses will suddenly be slower. When skip is 1, the benchmark accesses every line in order.

void benchmark(int bs, int nb, int trials, int skip)
{
  printf("block size: %d, blocks: %d, skip: %d, trials: %d\n", bs, nb, skip, trials);
  printf("total data size: %d\n", nb*bs*skip);
  printf("accessed data size: %d\n", nb*bs);
  uint8_t volatile data[nb*bs*skip];
  clock_t before = clock();
  for (int i = 0; i < trials; ++i) {
    for (int block = 0; block < nb; ++block) {
      data[block * bs * skip];
    }
  }
  clock_t after = clock() - before;
  double ns_per_access = (double)after/CLOCKS_PER_SEC/nb/trials * 1000000000;
  printf("%f ns per memory access\n", ns_per_access);
}

Again with skip = 1, the results match my hypothesis:

~ ❯❯❯ ./bm -s 64 -b 128 -t 10000000 -k 1
block size: 64, blocks: 128, skip: 1, trials: 10000000
total data size: 8192
accessed data size: 8192
0.269054 ns per memory access
~ ❯❯❯ ./bm -s 64 -b 256 -t 10000000 -k 1
block size: 64, blocks: 256, skip: 1, trials: 10000000
total data size: 16384
accessed data size: 16384
0.278184 ns per memory access
~ ❯❯❯ ./bm -s 64 -b 512 -t 10000000 -k 1
block size: 64, blocks: 512, skip: 1, trials: 10000000
total data size: 32768
accessed data size: 32768
0.245591 ns per memory access
~ ❯❯❯ ./bm -s 64 -b 1024 -t 10000000 -k 1
block size: 64, blocks: 1024, skip: 1, trials: 10000000
total data size: 65536
accessed data size: 65536
0.582870 ns per memory access

So far, so good: when everything fits in L1 cache, the inner loop runs about 4 times per nanosecond, or a bit more than once per clock cycle. When we make the data too big, it takes significantly longer. This is all consistent with my understanding of how the cache should work.

Now let's accessing every other block by letting skip be 2.

~ ❯❯❯ ./bm -s 64 -b 512 -t 10000000 -k 2
block size: 64, blocks: 512, skip: 2, trials: 10000000
total data size: 65536
accessed data size: 32768
0.582181 ns per memory access

This violates my understanding! It would make sense for a direct-mapped cache, but since our cache is associative, I can't see why the lines should be conflicting with each other. Why is accessing every other block slower?

But if I set skip to 3, things are fast again. In fact, any odd value of skip is fast; any even value is slow. For example:

~ ❯❯❯ ./bm -s 64 -b 512 -t 10000000 -k 7
block size: 64, blocks: 512, skip: 7, trials: 10000000
total data size: 229376
accessed data size: 32768
0.265338 ns per memory access
~ ❯❯❯ ./bm -s 64 -b 512 -t 10000000 -k 12
block size: 64, blocks: 512, skip: 12, trials: 10000000
total data size: 393216
accessed data size: 32768
0.616013 ns per memory access

Why is this happening?

For completeness: I am on a mid-2015 MacBook Pro running macOS 10.13.4. My full CPU brand string is Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz. I am compiling with cc -O3 -o bm bm.c; the compiler is the one shipped with Xcode 9.4.1. I have omitted the main function; all it does is parse the command-line options and call benchmark.

Solution

The cache is not fully-associative, it's set-associative, meaning that each address maps to a certain set, and the associativity only works among lines that map to the same set.

By making the step equal 2, you keep half of the sets out of the game, so the fact you access effectively 32K doesn't matter - you only have 16k available (even sets, for example), so you still exceed your capacity and start thrashing (end fetching data from the next level).

When the step is 3, the problem is gone since after wrapping around you can use all the sets. Same would go for any prime number (which is why it's sometimes used for address hashing)