I have a data set of 65K samples in 7 blocks: float arr[7][65536]
I need to calculate the mean for each corresponding 7 numbers so the result will be an array of size 65536:
float result[0] = arr[0][0] + arr[1][0] + arr[2][0] + ... / 7
float result[1] = arr[0][1] + arr[1][1] + arr[2][1] + ... / 7
The problem is that accessing the memory not sequentially will create many cache misses, Is there a better way in terms of memory to approach this problem? Reshaping the array in advance suffers from the same memory inefficiency.
Thanks.
One simple solution is to walk through the contiguous dimension. However, if sizeof(arr[0][0])
is quite big, this may not be optimal as result
will not fit in the L1 cache. The optimal solution is probably to use blocking to solve this issue.
Here is a C++ example code to do that:
// Blocked reduction using blocks of size 1024
// This loop iterate over the blocks
for(size_t j=0 ; j<65536 ; j+=1024)
{
for(size_t k=j ; k<j+1024 ; ++k)
result[k] = arr[0][k];
// Summation of the current block
for(size_t i=1 ; i<7 ; ++i)
for(size_t k=j ; k<j+1024 ; ++k)
result[k] += arr[i][k];
for(size_t k=j ; k<j+1024 ; ++k)
result[k] /= 7;
}
Note that the type of result
must be large enough to hold values 7 times smaller/bigger than the one of arr[0][0]
. This loop should be vectorized by your compiler and should produce fewer cache misses, resulting in much faster code.
PS: if the loop is not vectorized, you can help your compiler by adding the OpenMP directive #pragma omp simd
just before the inner k
-based loops (and ensure OpenMP is enabled).