OpenMP column-major access to a large memory-mapped array is (sometimes) slow

I've got a large binary file of doubles, representing an array with 650000 rows and 10210 columns -- it takes up about 53 GB. We are using C mmap (via <sys/mman.h>) to access it.

Unfortunately, we need to operate on the array column-by-column. If I run this in a single thread, it works fine, albeit a bit slow (typically 12-20 minutes depending on machine details).

To speed this up we have tried to use OpenMP for threading. Sometimes it works fine. But sometimes accessing the file crawls to a halt. With the exact same input, and the exact same OpenMP thread settings, the code can vary in speed by a factor of something like 500! This is specifically an issue on an M1 Mac Studio with 20 cores. (I know that 4 of them are "efficiency" cores, but the problem occues even when using many fewer than the (OpenMP default) full complement of 20.)

I can imagine that their could be contention issues when different threads are trying to access very different parts of the memory-mapped file, but I don't know if this is actually what's going on. Any specific ideas?

For information, we are opening the file and creating the mapping with

double* m_address = (double*)mmap(NULL, inputLength, PROT_READ, (MAP_SHARED | MAP_NORESERVE), m_filehandle, 0);

and, later, accessing it with

int numColumns = 10210;
int numRows = 650000;
std::vector<double> thisColumn(numRows, 0.0);

#pragma omp parallel for firstprivate(thisColumn)
for (unsigned int col=0; col<numColumns; ++col) {

    // the following is the slow bit...
    // read column col into thread-local thisSeries
    for (unsigned int row=0; row<numRows; ++row) {
        thisColumn[row] = m_address[size_t(row)*numColumns + col];
    }

    // process column col as thisColumn
}

Note that the inner loop is going through the array with a stride of numColumns=10210, so is jumping through the file from close to the beginning to close to the end for each value of col.

The bad behaviour seems to happen when compiled with either clang 14 or gcc/g++-13.

To be clear, the problem isn't a generic "cache miss" problem. Sometimes, it works fine, and shows reasonable speedups with moderate numbers of threads (from 1 to 8 or so) -- it is the brittleness of the setup, specifically when running with multiple threads, where the same setup can vary in execution time by two orders of magnitude!

Any ideas for what is really going on, and whether there are any good fixes? (And in particular why the behaviour varies so much with the same inputs?)

Solution

As stated in the comments, the strided access to the file is highly ineffecient, and this may be even worse with multithreaded access.

Assuming you don't want to tranpose the whole file at once and create a new file, a compromise would be to decouple reading and processing: reading say 100 columns at a time, processing these columns, reading the next 100 columns, etc... This is actually transposing on the fly and by blocks. There is still some strided access to the file, but a part of the accesses are now sequential.

Only the processing part is multithreaded, as multithreading the reads would result in false sharing on the destination array.

There's a compromise to find for the chunk size: the larger the better for the performances, but the data have to fit into memory.

(not tested code)

int numColumns = 10210;
int numRows = 650000;
int chunk = 100;

std::vector<vector<double>> theseColumns[chunk];
for (int col=0; col<chunk; col++) theseColumns[col] = vector<double>[numRows];

for (int col1=0; col1 < numColumns; col1+=chunk) {

    // reading chunk columns in serial mode
    for (int row=0; row<numRows; ++row) {
        for (int col2=0; col2 < chunk && col1+col2 < numColumns; col2++) {
            theseColumns[col2][row] = m_address[size_t(row)*numColumns + col1 + col2];
        }
    }

    // processing the chunk columns in parallel
   #pragma omp parallel for
   for (int col2=0; col2 < chunk && col1+col2 < numColumns; col2++) {
       // process theseColumns[col2]
   }

}

Edit: well, given the large number of rows, carefully multithreading the reads could actually be possible with minimal false sharing. Note that this makes sense only on a NVME SSD, as SATA SSDs don't really have parallel accesses, and this would even be catastrophic on a HDD.

Parallel version for the reads:

    #pragma omp parallel
    {
        int rowsPerThread = (numRows-1)/omp_get_num_threads() + 1;
        # pragma omp for schedule(static,rowsPerThread)
        for (int row=0; row<numRows; ++row) {
            for (int col2=0; col2 < chunk && col1+col2 < numColumns; col2++) {
                theseColumns[col2][row] = m_address[size_t(row)*numColumns + col1 + col2];
            }
        }
    }