C++ OpenMP doesn't speed up matrix multiplication

I have written a simple matrix multiplication program in C++, and it works. I am just a beginner in C++ and I have managed to make it work.

What baffled me was that it was much slower than NumPy. I have benchmarked it.

So I tried to speed it up with OpenMP but I observed absolutely no change in performance:

#include <algorithm>
#include <chrono>
#include <iostream>
#include <omp.h>
#include <string>
#include <vector>


using std::vector;
using std::chrono::high_resolution_clock;
using std::chrono::duration;
using std::chrono::duration_cast;
using std::chrono::microseconds;
using std::cout;
using line = vector<double>;
using matrix = vector<line>;


void fill(line &l) {
    std::generate(l.begin(), l.end(), []() { return ((double)rand() / (RAND_MAX)); });
}

matrix random_matrx(int64_t height, int64_t width) {
    matrix mat(height, line(width));
    std::for_each(mat.begin(), mat.end(), fill);
    return mat;
}

matrix dot_product(const matrix &mat0, const matrix &mat1) {
    size_t h0, w0, h1, w1;
    h0 = mat0.size();
    w0 = mat0[0].size();
    h1 = mat1.size();
    w1 = mat1[0].size();
    if (w0 != h1) {
        throw std::invalid_argument("matrices cannot be cross multiplied");
    }

    matrix out(h0, line(w1));
    for (int y = 0; y < h0; y++) {
        for (int x = 0; x < w1; x++) {
            double s = 0;
            for (int z = 0; z < w0; z++) {
                s += mat0[y][z] * mat1[z][x];
            }
            out[y][x] = s;
        }
    }

    return out;
}

matrix dot_product_omp(const matrix& mat0, const matrix& mat1) {
    size_t h0, w0, h1, w1;
    h0 = mat0.size();
    w0 = mat0[0].size();
    h1 = mat1.size();
    w1 = mat1[0].size();
    if (w0 != h1) {
        throw std::invalid_argument("matrices cannot be cross multiplied");
    }

    matrix out(h0, line(w1));
    omp_set_num_threads(4);
    #pragma omp parallel for private(y, x, z) schedule(dynamic)
    for (int y = 0; y < h0; y++) {
        for (int x = 0; x < w1; x++) {
            double s = 0;
            for (int z = 0; z < w0; z++) {
                s += mat0[y][z] * mat1[z][x];
            }
            out[y][x] = s;
        }
    }

    return out;
}


int main()
{
    matrix a, b;
    a = random_matrx(16, 9);
    b = random_matrx(9, 24);
    auto start = high_resolution_clock::now();
    for (int64_t i = 0; i < 65536; i++) {
        dot_product(a, b);
    }
    auto end = high_resolution_clock::now();
    duration<double, std::nano> time = end - start;
    double once = time.count() / 65536000;
    cout << "mat(16, 9) * mat(9, 24): " + std::to_string(once) + " microseconds\n";
    a = random_matrx(128, 256);
    b = random_matrx(256, 512);
    start = high_resolution_clock::now();
    for (int64_t i = 0; i < 512; i++) {
        dot_product(a, b);
    }
    end = high_resolution_clock::now();
    time = end - start;
    once = time.count() / 512000;
    cout << "mat(128, 256) * mat(256, 512): " + std::to_string(once) + " microseconds\n";
    start = high_resolution_clock::now();
    for (int64_t i = 0; i < 512; i++) {
        dot_product_omp(a, b);
    }
    end = high_resolution_clock::now();
    time = end - start;
    once = time.count() / 512000;
    cout << "mat(128, 256) * mat(256, 512) omp: " + std::to_string(once) + " microseconds\n";
}

PS D:\MyScript> C:\Users\Xeni\source\repos\matmul\x64\Release\matmul.exe
mat(16, 9) * mat(9, 24): 5.200116 microseconds
mat(128, 256) * mat(256, 512): 30128.739453 microseconds
mat(128, 256) * mat(256, 512) omp: 30116.103125 microseconds

I compiled it using Visual Studio 2022, C++20 standard, compiler flags:

/permissive- /ifcOutput "x64\Release\" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /Ob2 /sdl /Fd"x64\Release\vc143.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /std:c17 /Gd /Oi /MD /std:c++20 /FC /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Ot /Fp"x64\Release\matmul.pch" /diagnostics:column

Additional flags:

/arch:AVX2 /fp:fast

Just why there is no improvement? And how can I actually improve it?

I have changed my OMP version to this:

matrix dot_product_omp(const matrix& mat0, const matrix& mat1) {
    size_t h0, w0, h1, w1;
    h0 = mat0.size();
    w0 = mat0[0].size();
    h1 = mat1.size();
    w1 = mat1[0].size();
    if (w0 != h1) {
        throw std::invalid_argument("matrices cannot be cross multiplied");
    }

    matrix out(h0, line(w1));
    omp_set_num_threads(4);
    #pragma omp parallel for schedule(dynamic)
    for (int y = 0; y < h0; y++) {
        for (int x = 0; x < w1; x++) {
            double s = 0;
            for (int z = 0; z < w0; z++) {
                s += mat0[y][z] * mat1[z][x];
            }
            out[y][x] = s;
        }
    }

    return out;
}

I have compiled with /openmp flag, I have benchmarked multiple times, and it only made the code run in about a quarter of original time:

PS D:\MyScript> C:\Users\Xeni\source\repos\matmul\x64\Release\matmul.exe
mat(16, 9) * mat(9, 24): 5.126476 microseconds
mat(128, 256) * mat(256, 512): 30999.137109 microseconds
mat(128, 256) * mat(256, 512) omp: 8574.475195 microseconds

NumPy is much faster:

In [374]: a = np.random.random((128, 256))

In [375]: b = np.random.random((256, 512))

In [376]: %timeit a @ b
382 µs ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

My code is an order of magnitude slower. How can I then make the gap in performance smaller?

Solution

There are many issues in the code, but the key point is that the memory access pattern is inefficient and it prevents (nearly) any vectorization.

The access to mat1[z][x] is inefficient because when z increases, a new vector needs to be fetched and then the xth items is fetched from memory. This cause two random-like memory fetches. Such memory access are far slower than sequential one. Not to mention most compiler do not vectorize loops having such memory accesses because this is nearly impossible (this is theoretically possible with SIMD gathers but very inefficient in practice). On top of that, cache lines are poorly used : only 8/64 bytes of the cache lines related to mat1 are used and the rest is wasted since the cache line will quickly be evicted from the cache (causing cache trashing). Such an issue cause the application not to scale well on most platforms because it becomes memory-bound (using more cores do not make the RAM operating faster). You need to read data contiguously to get better performance. Here is a faster implementation:

    #pragma omp parallel for schedule(dynamic)
    for (int y = 0; y < h0; y++) {
        for (int x = 0; x < w1; x++) {
            out[y][x] += 0.0;
        }
        for (int z = 0; z < w0; z++) {
            for (int x = 0; x < w1; x++) {
                out[y][x] += mat0[y][z] * mat1[z][x];
            }
        }
    }

Before:
mat(16, 9) * mat(9, 24): 2.931490 microseconds
mat(128, 256) * mat(256, 512): 14704.138781 microseconds
mat(128, 256) * mat(256, 512) omp: 4013.665295 microseconds

After:
mat(16, 9) * mat(9, 24): 0.931926 microseconds
mat(128, 256) * mat(256, 512): 3296.070098 microseconds
mat(128, 256) * mat(256, 512) omp: 1230.341350 microseconds

The sequential code is 4.3 times faster than before and the parallel one is now 3.3 faster. The code should be vectorized now.

The code is still pretty inefficient due to other factors like:

cache misses/trashing: the matrix is reloaded many times;
small FMA/load ratio: the CPU spent its time loading data while it could spent its time doing FMA instruction;
memory indirections: vector<vector<double>> is a very inefficient data structure to store matrices, please use a flatten array (or Eigen);
schedule(dynamic) is inefficient on most machine and should not be actually useful : the work should be balanced between core if the code is optimized. Consider using schedule(static);
NUMA effects (especially on servers and AMD processors);
etc.

As mentioned in the comments, one should not expect to reach the speed of Numpy since it calls the dgemm BLAS primitive which is close to optimal on most machines (at least with OpenBLAS, BLIS and the Intel MKL). It is very very hard to get similar performance without low-level SIMD intrinsics or assembly code (most compilers generate sub-optimal code preventing optimal performance due to a bad register allocation).

A lot of HPC books and tutorials explain such problems and how to fix them. Some specifically explain how to get a relatively fast matrix multiplication. I strongly encourage you to read them.

Note that private(y, x, z) is useless since the variables are declared in the loop. In fact, this is actually not even correct (and compilers like GCC print an error). Also please note that using omp_set_num_threads(4); is generally considered as a bad practice. You should instead modify the environment variable OMP_NUM_THREADS. Besides note that the default OpenMP version supported by MSVC is very old and completely obsolete. The Clang OpenMP version should be used instead. In fact, AFAIK MSVC is not a good optimizing compiler so Clang/GCC should be used instead (or any HPC compiler like ICC, PGI, etc.).