Search code examples
c++arraysperformancecaching

Efficiently calculate mean in a large C++ array


I have a data set of 65K samples in 7 blocks: float arr[7][65536] I need to calculate the mean for each corresponding 7 numbers so the result will be an array of size 65536:

float result[0] = arr[0][0] + arr[1][0] + arr[2][0] + ... / 7
float result[1] = arr[0][1] + arr[1][1] + arr[2][1] + ... / 7

The problem is that accessing the memory not sequentially will create many cache misses, Is there a better way in terms of memory to approach this problem? Reshaping the array in advance suffers from the same memory inefficiency.

Thanks.


Solution

  • One simple solution is to walk through the contiguous dimension. However, if sizeof(arr[0][0]) is quite big, this may not be optimal as result will not fit in the L1 cache. The optimal solution is probably to use blocking to solve this issue.

    Here is a C++ example code to do that:

    // Blocked reduction using blocks of size 1024
    // This loop iterate over the blocks
    for(size_t j=0 ; j<65536 ; j+=1024)
    {
        for(size_t k=j ; k<j+1024 ; ++k)
            result[k] = arr[0][k];
    
        // Summation of the current block
        for(size_t i=1 ; i<7 ; ++i)
            for(size_t k=j ; k<j+1024 ; ++k)
                result[k] += arr[i][k];
    
        for(size_t k=j ; k<j+1024 ; ++k)
            result[k] /= 7;
    }
    

    Note that the type of result must be large enough to hold values 7 times smaller/bigger than the one of arr[0][0]. This loop should be vectorized by your compiler and should produce fewer cache misses, resulting in much faster code.

    PS: if the loop is not vectorized, you can help your compiler by adding the OpenMP directive #pragma omp simd just before the inner k-based loops (and ensure OpenMP is enabled).