After reading this article I decided to check it on my laptop. The idea is create array with size [1..40] Mb, and then iterate over it 1024 times (e.g for array with size 1 step will be 1024, for array with 2 mb size step will be 2048 etc.). My code is:
public class L3CacheBenchmark {
@State(Scope.Benchmark)
public static class P {
@Param({
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
"11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
"21", "22", "23", "24", "25", "26", "27", "28", "29", "30",
"31", "32", "33", "34", "35", "36", "37", "38", "39", "40",
})
public int size;
}
@State(Scope.Thread)
public static class ThreadData {
byte[] array;
int len;
@Setup
public void setup(P p) {
array = new byte[p.size * 1024 * 1024];
len = array.length;
}
}
@Benchmark
public byte[] testMethod(ThreadData data) {
int step = (data.len / 1024) - 1;
for (int k = 0; k < data.len; k += step) {
data.array[k] = 1;
}
return data.array;
}
}
And results:
Benchmark (size) Mode Cnt Score Error Units
L3CacheBenchmark.testMethod 1 thrpt 100 310521,031 ± 1124,590 ops/s
L3CacheBenchmark.testMethod 2 thrpt 100 331853,495 ± 1124,547 ops/s
L3CacheBenchmark.testMethod 3 thrpt 100 311499,659 ± 745,414 ops/s
L3CacheBenchmark.testMethod 4 thrpt 100 290270,382 ± 8501,690 ops/s
L3CacheBenchmark.testMethod 5 thrpt 100 212929,246 ± 14847,931 ops/s
L3CacheBenchmark.testMethod 6 thrpt 100 315968,138 ± 4454,210 ops/s
L3CacheBenchmark.testMethod 7 thrpt 100 209679,904 ± 26050,365 ops/s
L3CacheBenchmark.testMethod 8 thrpt 100 60409,187 ± 212,548 ops/s
L3CacheBenchmark.testMethod 9 thrpt 100 221290,756 ± 28970,586 ops/s
L3CacheBenchmark.testMethod 10 thrpt 100 322865,687 ± 1545,967 ops/s
L3CacheBenchmark.testMethod 11 thrpt 100 263153,747 ± 18497,624 ops/s
L3CacheBenchmark.testMethod 12 thrpt 100 298683,205 ± 1277,032 ops/s
L3CacheBenchmark.testMethod 13 thrpt 100 180984,220 ± 26611,649 ops/s
L3CacheBenchmark.testMethod 14 thrpt 100 324815,938 ± 1657,303 ops/s
L3CacheBenchmark.testMethod 15 thrpt 100 264965,412 ± 9335,923 ops/s
L3CacheBenchmark.testMethod 16 thrpt 100 58830,825 ± 291,412 ops/s
L3CacheBenchmark.testMethod 17 thrpt 100 255576,829 ± 7083,025 ops/s
L3CacheBenchmark.testMethod 18 thrpt 100 324174,133 ± 2247,157 ops/s
L3CacheBenchmark.testMethod 19 thrpt 100 212969,202 ± 18204,625 ops/s
L3CacheBenchmark.testMethod 20 thrpt 100 295246,470 ± 1224,817 ops/s
L3CacheBenchmark.testMethod 21 thrpt 100 251762,642 ± 23405,100 ops/s
L3CacheBenchmark.testMethod 22 thrpt 100 323196,428 ± 2245,465 ops/s
L3CacheBenchmark.testMethod 23 thrpt 100 254588,338 ± 23845,090 ops/s
L3CacheBenchmark.testMethod 24 thrpt 100 53373,580 ± 252,183 ops/s
L3CacheBenchmark.testMethod 25 thrpt 100 213220,459 ± 20440,716 ops/s
L3CacheBenchmark.testMethod 26 thrpt 100 322625,597 ± 2076,341 ops/s
L3CacheBenchmark.testMethod 27 thrpt 100 293643,720 ± 5260,010 ops/s
L3CacheBenchmark.testMethod 28 thrpt 100 297432,240 ± 1186,920 ops/s
L3CacheBenchmark.testMethod 29 thrpt 100 169277,701 ± 25040,239 ops/s
L3CacheBenchmark.testMethod 30 thrpt 100 324230,899 ± 1579,103 ops/s
L3CacheBenchmark.testMethod 31 thrpt 100 193981,979 ± 12478,424 ops/s
L3CacheBenchmark.testMethod 32 thrpt 100 53761,030 ± 259,888 ops/s
L3CacheBenchmark.testMethod 33 thrpt 100 213585,493 ± 23543,671 ops/s
L3CacheBenchmark.testMethod 34 thrpt 100 325214,062 ± 1758,479 ops/s
L3CacheBenchmark.testMethod 35 thrpt 100 306652,634 ± 2237,818 ops/s
L3CacheBenchmark.testMethod 36 thrpt 100 297992,930 ± 1019,248 ops/s
L3CacheBenchmark.testMethod 37 thrpt 100 181671,812 ± 21984,441 ops/s
L3CacheBenchmark.testMethod 38 thrpt 100 321929,616 ± 1798,747 ops/s
L3CacheBenchmark.testMethod 39 thrpt 100 251587,385 ± 12292,670 ops/s
L3CacheBenchmark.testMethod 40 thrpt 100 49777,196 ± 227,620 ops/s
As you can see, throughput is different, and the most striking difference is for arrays with size multiple of 8: speed degradation is almost 4 times. Also, for example, speed for array with size 37 Mb is almost twice less than for 38 Mb. I did not find any logical explanation of my findings.
P.S. Cpu i7 4700mq 6 Mb cache: http://www.cpu-world.com/CPUs/Core_i7/Intel-Core%20i7-4700MQ%20Mobile%20processor.html
What causes this behaviour?
You are observing the effect of cache associativity.
Your CPU has 256 KB 8-way set associative L2 cache per core. It can store up to 256 KB / 64
cache lines where no more than 8 lines can have the same index bits.
Your benchmark loop writes to 1025 different addresses. However, depending on the stride, these addresses may fall into the small number of sets, resulting in collisions and evictions from the cache. That's exactly what happens in your case when stride (step) = 8191, 16383, 24575, etc.
To verify this theory, rerun JMH benchmark with -prof perfnorm
option.
Here are the stats for size = 8 and size = 9:
L3CacheBenchmark.testMethod:CPI 8 thrpt 1.173 #/op
L3CacheBenchmark.testMethod:L1-dcache-load-misses 8 thrpt 1048.088 #/op
L3CacheBenchmark.testMethod:L1-dcache-loads 8 thrpt 1073.767 #/op
L3CacheBenchmark.testMethod:L1-dcache-store-misses 8 thrpt 1049.491 #/op
L3CacheBenchmark.testMethod:L1-dcache-stores 8 thrpt 1060.069 #/op
L3CacheBenchmark.testMethod:L1-icache-load-misses 8 thrpt 1.209 #/op
L3CacheBenchmark.testMethod:LLC-load-misses 8 thrpt 0.082 #/op
L3CacheBenchmark.testMethod:LLC-loads 8 thrpt 1.399 #/op
L3CacheBenchmark.testMethod:LLC-store-misses 8 thrpt 0.077 #/op
L3CacheBenchmark.testMethod:LLC-stores 8 thrpt 1035.877 #/op
L3CacheBenchmark.testMethod:branch-misses 8 thrpt 1.234 #/op
L3CacheBenchmark.testMethod:branches 8 thrpt 2096.674 #/op
L3CacheBenchmark.testMethod:cycles 8 thrpt 13520.964 #/op
L3CacheBenchmark.testMethod:dTLB-load-misses 8 thrpt 0.057 #/op
L3CacheBenchmark.testMethod:dTLB-loads 8 thrpt 1086.355 #/op
L3CacheBenchmark.testMethod:dTLB-store-misses 8 thrpt 0.020 #/op
L3CacheBenchmark.testMethod:dTLB-stores 8 thrpt 1068.579 #/op
L3CacheBenchmark.testMethod:iTLB-load-misses 8 thrpt 0.044 #/op
L3CacheBenchmark.testMethod:iTLB-loads 8 thrpt 0.018 #/op
L3CacheBenchmark.testMethod:instructions 8 thrpt 11530.742 #/op
L3CacheBenchmark.testMethod:stalled-cycles-backend 8 thrpt 8315.437 #/op
L3CacheBenchmark.testMethod:stalled-cycles-frontend 8 thrpt 10359.447 #/op
L3CacheBenchmark.testMethod:CPI 9 thrpt 0.871 #/op
L3CacheBenchmark.testMethod:L1-dcache-load-misses 9 thrpt 1055.973 #/op
L3CacheBenchmark.testMethod:L1-dcache-loads 9 thrpt 1068.958 #/op
L3CacheBenchmark.testMethod:L1-dcache-store-misses 9 thrpt 1045.480 #/op
L3CacheBenchmark.testMethod:L1-dcache-stores 9 thrpt 1057.328 #/op
L3CacheBenchmark.testMethod:L1-icache-load-misses 9 thrpt 1.108 #/op
L3CacheBenchmark.testMethod:LLC-load-misses 9 thrpt 0.174 #/op
L3CacheBenchmark.testMethod:LLC-loads 9 thrpt 0.304 #/op
L3CacheBenchmark.testMethod:LLC-store-misses 9 thrpt 0.045 #/op
L3CacheBenchmark.testMethod:LLC-stores 9 thrpt 0.350 #/op
L3CacheBenchmark.testMethod:branch-misses 9 thrpt 1.072 #/op
L3CacheBenchmark.testMethod:branches 9 thrpt 2099.846 #/op
L3CacheBenchmark.testMethod:cycles 9 thrpt 10041.724 #/op
L3CacheBenchmark.testMethod:dTLB-load-misses 9 thrpt 0.086 #/op
L3CacheBenchmark.testMethod:dTLB-loads 9 thrpt 1073.633 #/op
L3CacheBenchmark.testMethod:dTLB-store-misses 9 thrpt 0.045 #/op
L3CacheBenchmark.testMethod:dTLB-stores 9 thrpt 1054.587 #/op
L3CacheBenchmark.testMethod:iTLB-load-misses 9 thrpt 0.044 #/op
L3CacheBenchmark.testMethod:iTLB-loads 9 thrpt 0.037 #/op
L3CacheBenchmark.testMethod:instructions 9 thrpt 11529.996 #/op
L3CacheBenchmark.testMethod:stalled-cycles-backend 9 thrpt 3439.278 #/op
L3CacheBenchmark.testMethod:stalled-cycles-frontend 9 thrpt 6888.714 #/op
The most notable is the difference in LLC-stores
: 1035 for size = 8 and almost none for size = 9. This means that the data being stored does not fit into L2 cache and goes to L3.
BTW, your benchmark cannot measure the effect of L3 cache, since it touches only small amount of data (about 64 KB). In order to make a fair test, you'll need to read and write the whole range of the allocated array.