Matrix multiplication performance, int vs double

I'm trying matrix multiplication using MPI and would like to ask some help to understang one issue. The machine has 6 cores, 32KB L1 cache, 256KB L2 cache and 15MB L3 cache. And multiplication goes like this:

vector<vector<double>> mult_mpi(vector<vector<double>> m, 
                                vector<vector<double>> n) { 
    int rows = m.size();
    int size = n.size();
    vector<vector<double>> r(rows, vector<double>(size));

    for (int i = 0; i < rows; ++i) 
        for (int k = 0; k < size; ++k) 
            for (int j = 0; j < size; ++j) 
                r[i][j] += m[i][k] * n[k][j];
    return r;
}

I have the same for int:

vector<vector<int>> mult_mpi(vector<vector<int>> m, vector<vector<int>> n);

Then I made some plots, different line colors indicate the number of nodes.

The following plot shows time spent to multiply two int matrices:

And the following plot shows time spent to multiply two double matrices:

Why I get the same times for 4 and 6 nodes in the double case? Am I running into the limit on the bandwidth to memory?

I tried multiple times int the last hour, same result. Also checked machine load with top but to my eyes I'm alone there.

Solution

Are you sure that you aren't timing the allocation of 4K vector<>'s...?

vector<vector< >> is not a suitable type to squeeze optimal performance. The matrix multiplication is one of the best operation regarding scalability and "computation density" with respect to memory accesses. In fact the number of operations scale as O(N^3) while the number of data as O(N^2).

In fact it's used to benchmark top500 fastest systems on earth: HPL is for "high performance linpack", being linpack a reference implementation for linear algebral. Guess what... The operation being used in the benchmarks is DGEMM, that is "Double precision GEneral Matrix Matrix multiply".

DGEMM is the name of the operation in the BLAS library, e de-facto standard for linear algebra. Today there are many locally optimized BLAS library either commercial (INTEL MKL, IBM ESSL,...) and open source (ATLAS), but all of them use the same original (originally fortran, now C too) BLAS interface. (NOTE: the original implementation is not optimized)

Based on BLAS there are also LAPACK libraries: system solver, eigensystems,... There are also optimized lapack libraries, but usually 90% of the performance is squeezed by using optimized BLAS library.

I know very well one (not the only one... HPL is another one) powerful MPI based parallel library, which is SCALAPACK, and it contains PBLAS (parallel BLAS), and in it... an optimized and parallel version of DGEMM among other stuff.

SCALAPACK comes with the SLUG, where you can find an excellent explanation of the block-cyclic distribution, that's the data distribution strategy used to squeeze optimal performance arranging liner algebra problems on parallel systems.

To obtain optimal performance however you will need to link your MPI executable with a locally optimized BLAS library. Or write your own, but you are not alone, so don't reinvent the wheel.

Local optimization is obtained accessing matrices not by row, nor by column, but by block. With the block size being tuned in order to optimize the usage of the caches and/or of the TLB (I remember just now libgoto, another blas library, that optimized in order to minimize TLB misses, reached and surpassed on some systems Intel MKL... time ago). Find more information for example in this ATLAS paper.

In any case, if you really want to... I would start analyzing how the other wheels have been forged, before trying to make mine ;)