c caching optimization nested-loops simd

Why do I see a performance drop when using row-major order?

I have got a piece of code which runs over a large matrix and computes column-wise binned statistics, where the bins are given in a vector b.

The code goes (something) like this:

for (item = 0; item < items; item++) {
    uint8 bin = binvec[item];
    for (col = 0; col < columns; col++) {
        int idx = item * items_stride + col * cols_stride;
        uint8 val = matrix[idx];
        float x = matrix2[idx];
        count[bin][val][col] += x;
    }
}

Assume that the number of columns is known at compile time. The values of matrix have no specific structure / order - assume pure random values. the data size is fairly large: a couple of millions of items and hundreds of columns.

Looking at the code, I'd assume the best performance would be achieved when:

matrix is row major, for better cache locality.
count would be accessed as count[bin][col][val], so the computation of the address of count[bin][col] could be optimized, allowing easier prefetching etc.

However, I got the best performance when creating matrix as column-major, and accessing count in the order which appears in the code.

Trying to use options (1) or (2) causes a ~50% run time penalty. This goes against my intuition as to cache locality and compiler optimizations, vectorizations etc.

Any idea as to why? This really baffles me.

Solution

I am a bit confused. In your example matrix is row major. Could you share the both implementations disregarding the count access?

Your inner loop goes over columns so indeed row major would be better with cacheline covering several columns at once.

As for the count, your val depends on what is stored in your matrix while columns are ordered sequentially, so if you access count this way:

count[bin][val][col]

You get data from the cache if there are several sequential rows in column with equal val. Accessing it this way though:

count[bin][col][val]

You have basically zero chances to get data from cache since you hop way too far after you've incremented your col. This is my best bet at this part.

Is your matrix (the one providing val) as random as you think?