Intel MKL for matrix multiplication when rows are not memory-contiguous

Our hardware is Intel Xeon Phi so we are encouraged to get the most out of it by replacing hand-written linear algebra ops (e.g. square matrix multiplications) using Intel MKL.

The question is which should be the correct MKL usage for this case, as the problem of our matrices' rows not being contiguous in memory may forbid the usage of certain functions, e.g. cblas_dgemm.

Is this a use-case for sparse BLAS?

Example of matrix with non contiguous rows:

#include <iostream>

int main()
{

        // construct this matrix:
        //
        // ( 1 2 3 )
        // ( 4 5 6 )

        const int NCOLS = 3;

        // allocate two perhaps-not-contiguous blocks of memory
        double *row1 = (double*)malloc(NCOLS * sizeof(double));
        double *row2 = (double*)malloc(NCOLS * sizeof(double));

        // initialize them to the desired values, i.e.
        row1[0] = 1;
        row1[1] = 2;
        row1[2] = 3;
        row2[0] = 4;
        row2[1] = 5;
        row2[2] = 6;

        // allocate a block of two memory elements the size of a pointer
        double **matrix = (double**)malloc(2 * sizeof(double*));

        // set them to point to the two (perhaps-not-contiguous) previous blocks
        matrix[0] = &row1[0];
        matrix[1] = &row2[0];

        // print
        for (auto j=0; j<2; j++)
        {
                for (auto i=0; i<3; i++)
                {
                        std::cout << matrix[j][i] << ",";
                }
                std::cout << "\n";
        }
}

Solution

Dense BLAS operations can operate on matrices with a given fixed stride but in your case the stride is not constant. Sparse matrices are meant to operate on matrices containing a lot of zeros which is apparently not your case (at least not in the provided example).

Since your matrices is huge in practice (20k x 20k), the fastest solution is to copy the matrix lines in a big contiguous one. Indeed, BLAS dense operations would not be efficient on arrays of arrays even if it was supported (which is AFAIK not the case anyway). This is because arrays of arrays are generally not efficiently stored in cache (if you are lucky, lines can be allocated nearly contiguously though) and require a more complex indexing.

For example, on a x86-64 PC like mine with a ~40 GB/s RAM and a i5-9600KF processor capable of operating at ~300 GFlops on such matrices, a O(n**3) matrix multiplication would require about 2 * 20_000**3 / 300e9 = 53.3 seconds. Meanwhile an optimal copy of the matrix would require about 2 * 8 * 20_000**2 / 40e9 = 0.16 seconds. Thus, the time of the copy is negligible compared to the actual matrix multiplication time. That being said, three copy are certainly needed (the two input matrices and the output one). Additionally, fast BLAS implementations use the Strassen algorithm with a better asymptotic complexity that should be about 2~3 faster in this case. Still, the time of all the copies (~0.5 s) is very small compared to the actual matrix multiplication time (>17.8 s).

On a KNL Xeon Phi, the MCDRAM reach a throughput of ~400 GB/s, the main RAM reach a throughput of ~90 GB/s while the processor can reach 3 TFlops. Thus, if you store the matrices in the MCDRAM, the results should be 1.8~5.3 second for the matrix multiplication and 0.05 second for all the copies. If the matrices are stored in the slow DRAM, then the copy time is 0.21 second which is significantly bigger but still not so big compared to the computation time. If you do not have enough space to store the matrices in MCDRAM, then you can split the matrices in big tiles (eg. 10k x 10k) and compute each tile separately (using copies and a BLAS DGEMM for each tile).

If you want to achieve higher performance, then you can dedicate few threads of the Xeon Phi processor to make copy of some block so to overlap the copy time with the computation time. However, this will certainly make the code much more complex for a small improvement.