L3 cpu cache java benchmark shows strange results

After reading this article I decided to check it on my laptop. The idea is create array with size [1..40] Mb, and then iterate over it 1024 times (e.g for array with size 1 step will be 1024, for array with 2 mb size step will be 2048 etc.). My code is:

public class L3CacheBenchmark {

    @State(Scope.Benchmark)
    public static class P {

        @Param({
                       "1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
                       "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
                       "21", "22", "23", "24", "25", "26", "27", "28", "29", "30",
                       "31", "32", "33", "34", "35", "36", "37", "38", "39", "40",
               })
        public int size;
    }

    @State(Scope.Thread)
    public static class ThreadData {

        byte[] array;
        int    len;

        @Setup
        public void setup(P p) {
            array = new byte[p.size * 1024 * 1024];
            len = array.length;
        }
    }


    @Benchmark
    public byte[] testMethod(ThreadData data) {
        int step = (data.len / 1024) - 1;
        for (int k = 0; k < data.len; k += step) {
            data.array[k] = 1;
        }
        return data.array;
    }

}

And results:

Benchmark                    (size)   Mode  Cnt       Score       Error  Units
L3CacheBenchmark.testMethod       1  thrpt  100  310521,031 ±  1124,590  ops/s
L3CacheBenchmark.testMethod       2  thrpt  100  331853,495 ±  1124,547  ops/s
L3CacheBenchmark.testMethod       3  thrpt  100  311499,659 ±   745,414  ops/s
L3CacheBenchmark.testMethod       4  thrpt  100  290270,382 ±  8501,690  ops/s
L3CacheBenchmark.testMethod       5  thrpt  100  212929,246 ± 14847,931  ops/s
L3CacheBenchmark.testMethod       6  thrpt  100  315968,138 ±  4454,210  ops/s
L3CacheBenchmark.testMethod       7  thrpt  100  209679,904 ± 26050,365  ops/s
L3CacheBenchmark.testMethod       8  thrpt  100   60409,187 ±   212,548  ops/s
L3CacheBenchmark.testMethod       9  thrpt  100  221290,756 ± 28970,586  ops/s
L3CacheBenchmark.testMethod      10  thrpt  100  322865,687 ±  1545,967  ops/s
L3CacheBenchmark.testMethod      11  thrpt  100  263153,747 ± 18497,624  ops/s
L3CacheBenchmark.testMethod      12  thrpt  100  298683,205 ±  1277,032  ops/s
L3CacheBenchmark.testMethod      13  thrpt  100  180984,220 ± 26611,649  ops/s
L3CacheBenchmark.testMethod      14  thrpt  100  324815,938 ±  1657,303  ops/s
L3CacheBenchmark.testMethod      15  thrpt  100  264965,412 ±  9335,923  ops/s
L3CacheBenchmark.testMethod      16  thrpt  100   58830,825 ±   291,412  ops/s
L3CacheBenchmark.testMethod      17  thrpt  100  255576,829 ±  7083,025  ops/s
L3CacheBenchmark.testMethod      18  thrpt  100  324174,133 ±  2247,157  ops/s
L3CacheBenchmark.testMethod      19  thrpt  100  212969,202 ± 18204,625  ops/s
L3CacheBenchmark.testMethod      20  thrpt  100  295246,470 ±  1224,817  ops/s
L3CacheBenchmark.testMethod      21  thrpt  100  251762,642 ± 23405,100  ops/s
L3CacheBenchmark.testMethod      22  thrpt  100  323196,428 ±  2245,465  ops/s
L3CacheBenchmark.testMethod      23  thrpt  100  254588,338 ± 23845,090  ops/s
L3CacheBenchmark.testMethod      24  thrpt  100   53373,580 ±   252,183  ops/s
L3CacheBenchmark.testMethod      25  thrpt  100  213220,459 ± 20440,716  ops/s
L3CacheBenchmark.testMethod      26  thrpt  100  322625,597 ±  2076,341  ops/s
L3CacheBenchmark.testMethod      27  thrpt  100  293643,720 ±  5260,010  ops/s
L3CacheBenchmark.testMethod      28  thrpt  100  297432,240 ±  1186,920  ops/s
L3CacheBenchmark.testMethod      29  thrpt  100  169277,701 ± 25040,239  ops/s
L3CacheBenchmark.testMethod      30  thrpt  100  324230,899 ±  1579,103  ops/s
L3CacheBenchmark.testMethod      31  thrpt  100  193981,979 ± 12478,424  ops/s
L3CacheBenchmark.testMethod      32  thrpt  100   53761,030 ±   259,888  ops/s
L3CacheBenchmark.testMethod      33  thrpt  100  213585,493 ± 23543,671  ops/s
L3CacheBenchmark.testMethod      34  thrpt  100  325214,062 ±  1758,479  ops/s
L3CacheBenchmark.testMethod      35  thrpt  100  306652,634 ±  2237,818  ops/s
L3CacheBenchmark.testMethod      36  thrpt  100  297992,930 ±  1019,248  ops/s
L3CacheBenchmark.testMethod      37  thrpt  100  181671,812 ± 21984,441  ops/s
L3CacheBenchmark.testMethod      38  thrpt  100  321929,616 ±  1798,747  ops/s
L3CacheBenchmark.testMethod      39  thrpt  100  251587,385 ± 12292,670  ops/s
L3CacheBenchmark.testMethod      40  thrpt  100   49777,196 ±   227,620  ops/s

As you can see, throughput is different, and the most striking difference is for arrays with size multiple of 8: speed degradation is almost 4 times. Also, for example, speed for array with size 37 Mb is almost twice less than for 38 Mb. I did not find any logical explanation of my findings.

P.S. Cpu i7 4700mq 6 Mb cache: http://www.cpu-world.com/CPUs/Core_i7/Intel-Core%20i7-4700MQ%20Mobile%20processor.html

What causes this behaviour?

Solution

You are observing the effect of cache associativity.

Your CPU has 256 KB 8-way set associative L2 cache per core. It can store up to 256 KB / 64 cache lines where no more than 8 lines can have the same index bits.

Your benchmark loop writes to 1025 different addresses. However, depending on the stride, these addresses may fall into the small number of sets, resulting in collisions and evictions from the cache. That's exactly what happens in your case when stride (step) = 8191, 16383, 24575, etc.

To verify this theory, rerun JMH benchmark with -prof perfnorm option.
Here are the stats for size = 8 and size = 9:

L3CacheBenchmark.testMethod:CPI                      8  thrpt      1.173  #/op
L3CacheBenchmark.testMethod:L1-dcache-load-misses    8  thrpt   1048.088  #/op
L3CacheBenchmark.testMethod:L1-dcache-loads          8  thrpt   1073.767  #/op
L3CacheBenchmark.testMethod:L1-dcache-store-misses   8  thrpt   1049.491  #/op
L3CacheBenchmark.testMethod:L1-dcache-stores         8  thrpt   1060.069  #/op
L3CacheBenchmark.testMethod:L1-icache-load-misses    8  thrpt      1.209  #/op
L3CacheBenchmark.testMethod:LLC-load-misses          8  thrpt      0.082  #/op
L3CacheBenchmark.testMethod:LLC-loads                8  thrpt      1.399  #/op
L3CacheBenchmark.testMethod:LLC-store-misses         8  thrpt      0.077  #/op
L3CacheBenchmark.testMethod:LLC-stores               8  thrpt   1035.877  #/op
L3CacheBenchmark.testMethod:branch-misses            8  thrpt      1.234  #/op
L3CacheBenchmark.testMethod:branches                 8  thrpt   2096.674  #/op
L3CacheBenchmark.testMethod:cycles                   8  thrpt  13520.964  #/op
L3CacheBenchmark.testMethod:dTLB-load-misses         8  thrpt      0.057  #/op
L3CacheBenchmark.testMethod:dTLB-loads               8  thrpt   1086.355  #/op
L3CacheBenchmark.testMethod:dTLB-store-misses        8  thrpt      0.020  #/op
L3CacheBenchmark.testMethod:dTLB-stores              8  thrpt   1068.579  #/op
L3CacheBenchmark.testMethod:iTLB-load-misses         8  thrpt      0.044  #/op
L3CacheBenchmark.testMethod:iTLB-loads               8  thrpt      0.018  #/op
L3CacheBenchmark.testMethod:instructions             8  thrpt  11530.742  #/op
L3CacheBenchmark.testMethod:stalled-cycles-backend   8  thrpt   8315.437  #/op
L3CacheBenchmark.testMethod:stalled-cycles-frontend  8  thrpt  10359.447  #/op

L3CacheBenchmark.testMethod:CPI                      9  thrpt      0.871  #/op
L3CacheBenchmark.testMethod:L1-dcache-load-misses    9  thrpt   1055.973  #/op
L3CacheBenchmark.testMethod:L1-dcache-loads          9  thrpt   1068.958  #/op
L3CacheBenchmark.testMethod:L1-dcache-store-misses   9  thrpt   1045.480  #/op
L3CacheBenchmark.testMethod:L1-dcache-stores         9  thrpt   1057.328  #/op
L3CacheBenchmark.testMethod:L1-icache-load-misses    9  thrpt      1.108  #/op
L3CacheBenchmark.testMethod:LLC-load-misses          9  thrpt      0.174  #/op
L3CacheBenchmark.testMethod:LLC-loads                9  thrpt      0.304  #/op
L3CacheBenchmark.testMethod:LLC-store-misses         9  thrpt      0.045  #/op
L3CacheBenchmark.testMethod:LLC-stores               9  thrpt      0.350  #/op
L3CacheBenchmark.testMethod:branch-misses            9  thrpt      1.072  #/op
L3CacheBenchmark.testMethod:branches                 9  thrpt   2099.846  #/op
L3CacheBenchmark.testMethod:cycles                   9  thrpt  10041.724  #/op
L3CacheBenchmark.testMethod:dTLB-load-misses         9  thrpt      0.086  #/op
L3CacheBenchmark.testMethod:dTLB-loads               9  thrpt   1073.633  #/op
L3CacheBenchmark.testMethod:dTLB-store-misses        9  thrpt      0.045  #/op
L3CacheBenchmark.testMethod:dTLB-stores              9  thrpt   1054.587  #/op
L3CacheBenchmark.testMethod:iTLB-load-misses         9  thrpt      0.044  #/op
L3CacheBenchmark.testMethod:iTLB-loads               9  thrpt      0.037  #/op
L3CacheBenchmark.testMethod:instructions             9  thrpt  11529.996  #/op
L3CacheBenchmark.testMethod:stalled-cycles-backend   9  thrpt   3439.278  #/op
L3CacheBenchmark.testMethod:stalled-cycles-frontend  9  thrpt   6888.714  #/op

The most notable is the difference in LLC-stores: 1035 for size = 8 and almost none for size = 9. This means that the data being stored does not fit into L2 cache and goes to L3.

BTW, your benchmark cannot measure the effect of L3 cache, since it touches only small amount of data (about 64 KB). In order to make a fair test, you'll need to read and write the whole range of the allocated array.