c++caching memory cpu-registers cpu-cache

How is array of pair<double,double> 2 times faster than two arrays of double C++

#include <iostream>
#include <chrono>
#include <random>
#include <time.h>
using namespace std;

typedef pair<double,double> pd;
#define x first
#define y second
#define cell(i,j,w) ((i)*(w) + (j))

class MyTimer
{
private:
    std::chrono::time_point<std::chrono::steady_clock> starter;
    std::chrono::time_point<std::chrono::steady_clock> ender;

public:
    void startCounter() {
        starter = std::chrono::steady_clock::now();
    }

    long long getCounter() {
        ender = std::chrono::steady_clock::now();
        return std::chrono::duration_cast<std::chrono::milliseconds>(ender - starter).count();
    }
};
        
int main()
{
    const int n = 5000;
    int* value1 = new int[(n + 1) * (n + 1)];
    int* value2 = new int[(n + 1) * (n + 1)];
    double* a = new double[(n + 1) * (n + 1)];
    double* b = new double[(n + 1) * (n + 1)];
    pd* packed = new pd[(n + 1) * (n + 1)];
    MyTimer timer;

    for (int i = 1; i <= n; i++)
        for (int j = 1; j <= n; j++) {
            value1[cell(i, j, n + 1)] = rand() % 5000;
            value2[cell(i, j, n + 1)] = rand() % 5000;
        }

    for (int i = 1; i <= n; i++) {
        a[cell(i, 0, n + 1)] = 0;
        a[cell(0, i, n + 1)] = 0;
        b[cell(i, 0, n + 1)] = 0;
        b[cell(0, i, n + 1)] = 0;
        packed[cell(i, 0, n + 1)] = pd(0, 0);
        packed[cell(0, i, n + 1)] = pd(0, 0);
    }

    for (int tt=1; tt<=5; tt++)
    {
        timer.startCounter();
        for (int i=1; i<=n; i++)
            for (int j = 1; j <= n; j++) {
                // packed[i][j] = packed[i-1][j] + packed[i][j-1] - packed[i-1][j-1] + value1[i][j]
                packed[cell(i, j, n + 1)].x = packed[cell(i - 1, j, n + 1)].x + packed[cell(i, j - 1, n + 1)].x - packed[cell(i - 1, j - 1, n + 1)].x + value1[cell(i, j, n + 1)];
                packed[cell(i, j, n + 1)].y = packed[cell(i - 1, j, n + 1)].y + packed[cell(i, j - 1, n + 1)].y - packed[cell(i - 1, j - 1, n + 1)].y + value1[cell(i, j, n + 1)] * value1[cell(i, j, n + 1)];
            }
        cout << "Time packed  = " << timer.getCounter() << "\n";

        timer.startCounter();
        for (int i=1; i<=n; i++)
            for (int j = 1; j <= n; j++) {
                // a[i][j] = a[i-1][j] + a[i][j-1] - a[i-1][j-1] + value2[i][j];
                // b[i][j] = b[i-1][j] + b[i][j-1] - b[i-1][j-1] + value2[i][j] * value2[i][j];
                a[cell(i, j, n + 1)] = a[cell(i - 1, j, n + 1)] + a[cell(i, j - 1, n + 1)] - a[cell(i - 1, j - 1, n + 1)] + value2[cell(i, j, n + 1)];
                b[cell(i, j, n + 1)] = b[cell(i - 1, j, n + 1)] + b[cell(i, j - 1, n + 1)] - b[cell(i - 1, j - 1, n + 1)] + value2[cell(i, j, n + 1)] * value2[cell(i, j, n + 1)];
            }
        cout << "Time separate = " << timer.getCounter() << "\n\n";
    }

    delete[] value1;
    delete[] value2;
    delete[] a;
    delete[] b;
    delete[] packed;
}

So I'm computing a 2D prefix table (Summed Area Table). And I notice the property in the title.

When using CUDA nvcc compiler (with -O2) using the command line or Visual Studio Release mode , the result is 2x faster (separate takes 200ms, packed takes 100ms) the first run, but only 25% faster in subsequent run (this is because value2[] is cached after the first loop). In my actual program with more steps of calculation (computing SAT is just step 1), it's always 2x faster since value1[] and value2[] have definitely been evicted from cache.

I know packed array is faster because modern Intel CPU read 32-64 bytes into cache at once. So by packing both array together, it can read both data in 1 main memory (RAM) access instead of 2. But why is the speedup so high? Along with memory access, the CPU still has to perform 6 additions, 2 subtractions, and 1 multiply per loop. 2x speedup from halving memory access is 100% improvement efficiency (Amdahl Law), the same as if those add/mult operations didn't exist. How is it possible?

I'm certain it has something to do with CPU pipelining, but can't explain more thoroughly. Can anyone explain this further in terms of instruction latency/memory access latency/assembly? Thank you.

The code doesn't use any GPU, so any other good compiler should give the same 2x speedup as nvcc. On g++ 9.3.0 (g++ file.cpp -O2 -std=c++11 -o file.exe), it's also 2x speedup. CPU is Intel i7-7700

I've run this program here and here2 with command line arguments -O2 -std=c++11, it also shows 1.5-2x speedup. Use n = 3000, bigger and it won't run (free VM service, afterall). So it's not just my computer

Solution

The answer is in the access latency of different level of memory, from L1 cache -> main memory (RAM).

Data in L1 cache takes ~~5 cycle to access, while data from RAM takes 50-100cycle. Meanwhile, add/sub/mult operations takes 3-5 cycles.

Therefore, the dominating limiter of performance is main memory access. So by reducing the number of main memory request by half, performance almost doubles