c++parallel-processing openmp matrix-multiplication

How to optimize my C++ OpenMp Matrix Multiplication code

I have written a C++ OpenMp Matrix Multiplication code that multiplies two 1000x1000 matrices. So far I have gotten a 0.700 sec execution time using OpenMp but I want to see if there is other ways I can make it faster using OpenMp?

I appreciate any advice or tips you can give me.

Here is my code:

#include <iostream>
#include <time.h>
#include <omp.h>

using namespace std;

void Multiply()
 {
    //initialize matrices with random numbers
    int aMatrix[1000][1000], i, j;
  for( i = 0; i < 1000; ++i)
  {for( j = 0;  j < 1000; ++j)
     {aMatrix[i][j] = rand();}
  }
    int bMatrix[1000][1000], i1, j2;
  for( i1 = 0; i1 < 1000; ++i1)
  {for( j2 = 0;  j2 < 1000; ++j2)
     {bMatrix[i1][j2] = rand();}
  }
  //Result Matrix
    int product[1000][1000]  = {0};


    for (int row = 0; row < 1000; row++) {
        for (int col = 0; col < 1000; col++) {
            // Multiply the row of A by the column of B to get the row, column of product.
            for (int inner = 0; inner < 1000; inner++) {
                product[row][col] += aMatrix[row][inner] * bMatrix[inner][col];
            }

        }
    
    }
}

int main() {

   time_t begin, end;
    time(&begin);

    Multiply();

  time(&end);
  time_t elapsed = end - begin;

  cout << ("Time measured: %ld seconds.\n", elapsed);

return 0;
}

Solution

Following things can be done for speedup:

Using OpenMP for parallelizing external loop, like you did (and like I also did in my following code). Or alternatively using std::async for multi-threading like it was used in another answer.
Transpose B matrix, this will help to increase L1 cache hits, because you will read from sequential memory each B column (or row in transposed variant).
Use vectorized SIMD instructions, this will allow to do several multiplications (and additions) within one CPU cycle. Often compilers do auto-vectorization of your loops well enough through SIMD instructions without your help, but I did it explicitly in my code.
Run several independent SIMD instructions within loop. This will help to occupy whole CPU pipeline of SIMD. I did so in my code by using four SIMD registers r0, r1, r2, r3. In compilers this is usually called loop unrolling.
Align your matrix starting address on 64-bytes boundary. This will help SIMD instructions to do fast aligned read/write.
Align starting address of each matrix row on 64-bytes boundary. I did this in my code by padding each row with zeros till multiple of 64-bytes. This also helps SIMD instructions to do fast aligned read/write.

In my following code I did all 1. - 6. steps above. Memory 64-bytes alignment I did through implementing AlignmentAllocator that was used in std::vector. Also I did time measurements for float/double/int.

On my old 4-core laptop I got following time measurements for the case of 1000x1000 matrix multiplying by 1000x1000:

 float: time 0.1569 sec
double: time 0.3168 sec
   int: time 0.1565 sec

To compare my hardware capabilities I did measurements of another answer of @doug for the case of int:

Threads w transpose   0.2164 secs.

As one can see my solution is 1.4x times faster that the other answer, I guess due to memory 64-bytes alignment and maybe due to using explicit SIMD (instead of relying on compiler auto-vectorization of a loop).

To compile my program, don't forget to add -fopenmp -lgomp options (for OpenMP support) and -march=native -O3 -std=c++20 options (for SIMD support, optimizations and standard) if you're compiling under GCC/CLang, while MSVC I guess adds OpenMP automatically and doesn't need any special options (use /O2 /GL /std:c++latest for optimizations and standard in MSVC).

In my code I only implemented SSE2/SSE4/AVX/AVX2 instructions for SIMD, if you have more powerful machine you may tell me and I implement also FMA/AVX-512, they will give even twice more speed boost.

My Mul() function is quite generic, it is templated, and you just pass pointers to matrices and row/col count, so your matrices may be stored on calling side in any way (through std::vector or std::array or plain 2D array).

At start of Run() function you may change number of rows and columns if you need a bigger test. Notice that all my functions support any rows and columns, you may even multiply matrix of size 1234x2345 by 2345x3456.

Try it online!

#include <cstdint>
#include <cstring>
#include <stdexcept>
#include <iostream>
#include <iomanip>
#include <vector>
#include <memory>
#include <string>

#include <immintrin.h>

#define USE_OPENMP 1
#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg '" + std::string(msg) + "'."); }
#define ASSERT(cond) ASSERT_MSG(cond, "")
#if defined(_MSC_VER)
    #define IS_MSVC 1
#else
    #define IS_MSVC 0
#endif

#if USE_OPENMP
    #include <omp.h>
#endif

template <typename T, std::size_t N>
class AlignmentAllocator {
  public:
    typedef T value_type;
    typedef std::size_t size_type;
    typedef std::ptrdiff_t difference_type;
    typedef T * pointer;
    typedef const T * const_pointer;
    typedef T & reference;
    typedef const T & const_reference;

  public:
    inline AlignmentAllocator() throw() {}
    template <typename T2> inline AlignmentAllocator(const AlignmentAllocator<T2, N> &) throw() {}
    inline ~AlignmentAllocator() throw() {}
    inline pointer adress(reference r) { return &r; }
    inline const_pointer adress(const_reference r) const { return &r; }
    inline pointer allocate(size_type n);
    inline void deallocate(pointer p, size_type);
    inline void construct(pointer p, const value_type & wert);
    inline void destroy(pointer p) { p->~value_type(); }
    inline size_type max_size() const throw() { return size_type(-1) / sizeof(value_type); }
    template <typename T2> struct rebind { typedef AlignmentAllocator<T2, N> other; };
    bool operator!=(const AlignmentAllocator<T, N> & other) const { return !(*this == other); }
    bool operator==(const AlignmentAllocator<T, N> & other) const { return true; }
};

template <typename T, std::size_t N>
inline typename AlignmentAllocator<T, N>::pointer AlignmentAllocator<T, N>::allocate(size_type n) {
    #if IS_MSVC
        auto p = (pointer)_aligned_malloc(n * sizeof(value_type), N);
    #else
        auto p = (pointer)std::aligned_alloc(N, n * sizeof(value_type));
    #endif
    ASSERT(p);
    return p;
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::deallocate(pointer p, size_type) {
    #if IS_MSVC
        _aligned_free(p);
    #else
        std::free(p);
    #endif
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::construct(pointer p, const value_type & wert) {
    new (p) value_type(wert);
}

template <typename T>
using AlignedVector = std::vector<T, AlignmentAllocator<T, 64>>;

template <typename T>
struct RegT;

#ifdef __AVX__
    template <> struct RegT<float> { static size_t constexpr bisize = 256; using type = __m256; static type zero() { return _mm256_setzero_ps(); } };
    template <> struct RegT<double> { static size_t constexpr bisize = 256; using type = __m256d; static type zero() { return _mm256_setzero_pd(); } };
    
    inline void MulAddReg(float const * a, float const * b, __m256 & c) {
        c = _mm256_add_ps(c, _mm256_mul_ps(_mm256_load_ps(a), _mm256_load_ps(b)));
    }
    inline void MulAddReg(double const * a, double const * b, __m256d & c) {
        c = _mm256_add_pd(c, _mm256_mul_pd(_mm256_load_pd(a), _mm256_load_pd(b)));
    }
    
    inline void StoreReg(float * dst, __m256 const & src) { _mm256_store_ps(dst, src); }
    inline void StoreReg(double * dst, __m256d const & src) { _mm256_store_pd(dst, src); }
#else // SSE2
    template <> struct RegT<float> { static size_t constexpr bisize = 128; using type = __m128; static type zero() { return _mm_setzero_ps(); } };
    template <> struct RegT<double> { static size_t constexpr bisize = 128; using type = __m128d; static type zero() { return _mm_setzero_pd(); } };

    inline void MulAddReg(float const * a, float const * b, __m128 & c) {
        c = _mm_add_ps(c, _mm_mul_ps(_mm_load_ps(a), _mm_load_ps(b)));
    }
    inline void MulAddReg(double const * a, double const * b, __m128d & c) {
        c = _mm_add_pd(c, _mm_mul_pd(_mm_load_pd(a), _mm_load_pd(b)));
    }
    
    inline void StoreReg(float * dst, __m128 const & src) { _mm_store_ps(dst, src); }
    inline void StoreReg(double * dst, __m128d const & src) { _mm_store_pd(dst, src); }
#endif

#ifdef __AVX2__
    template <> struct RegT<int32_t> { static size_t constexpr bisize = 256; using type = __m256i; static type zero() { return _mm256_setzero_si256(); } };
    //template <> struct RegT<int64_t> { static size_t constexpr bisize = 256; using type = __m256i; static type zero() { return _mm256_setzero_si256(); } };

    inline void MulAddReg(int32_t const * a, int32_t const * b, __m256i & c) {
        c = _mm256_add_epi32(c, _mm256_mullo_epi32(_mm256_load_si256((__m256i*)a), _mm256_load_si256((__m256i*)b)));
    }
    //inline void MulAddReg(int64_t const * a, int64_t const * b, __m256i & c) {
    //    c = _mm256_add_epi64(c, _mm256_mullo_epi64(_mm256_load_si256((__m256i*)a), _mm256_load_si256((__m256i*)b)));
    //}
    
    inline void StoreReg(int32_t * dst, __m256i const & src) { _mm256_store_si256((__m256i*)dst, src); }
    //inline void StoreReg(int64_t * dst, __m256i const & src) { _mm256_store_si256((__m256i*)dst, src); }
#else // SSE2
    template <> struct RegT<int32_t> { static size_t constexpr bisize = 128; using type = __m128i; static type zero() { return _mm_setzero_si128(); } };
    //template <> struct RegT<int64_t> { static size_t constexpr bisize = 128; using type = __m128i; static type zero() { return _mm_setzero_si128(); } };

    inline void MulAddReg(int32_t const * a, int32_t const * b, __m128i & c) {
        c = _mm_add_epi32(c, _mm_mullo_epi32(_mm_load_si128((__m128i*)a), _mm_load_si128((__m128i*)b)));
    }
    //inline void MulAddReg(int64_t const * a, int64_t const * b, __m128i & c) {
    //    c = _mm_add_epi64(c, _mm_mullo_epi64(_mm_load_si128((__m128i*)a), _mm_load_si128((__m128i*)b)));
    //}
    
    inline void StoreReg(int32_t * dst, __m128i const & src) { _mm_store_si128((__m128i*)dst, src); }
    //inline void StoreReg(int64_t * dst, __m128i const & src) { _mm_store_si128((__m128i*)dst, src); }
#endif    

template <typename T>
void Mul(T const * A0, size_t A_rows, size_t A_cols, T const * B0, size_t B_rows, size_t B_cols, T * C) {
    size_t constexpr reg_cnt = RegT<T>::bisize / 8 / sizeof(T), block = 4 * reg_cnt;
    ASSERT(A_cols == B_rows);
    size_t const A_cols_aligned = (A_cols + block - 1) / block * block, B_rows_aligned = (B_rows + block - 1) / block * block;
    
    // Copy aligned A
    AlignedVector<T> Av(A_rows * A_cols_aligned);
    for (size_t i = 0; i < A_rows; ++i)
        std::memcpy(&Av[i * A_cols_aligned], &A0[i * A_cols], sizeof(Av[0]) * A_cols);
    T const * A = Av.data();
    // Transpose B
    AlignedVector<T> Bv(B_cols * B_rows_aligned);
    for (size_t j = 0; j < B_cols; ++j)
        for (size_t i = 0; i < B_rows; ++i)
            Bv[j * B_rows_aligned + i] = B0[i * B_cols + j];
    T const * Bt = Bv.data();
    ASSERT(uintptr_t(A) % 64 == 0 && uintptr_t(Bt) % 64 == 0);
    ASSERT(uintptr_t(&A[A_cols_aligned]) % 64 == 0 && uintptr_t(&Bt[B_rows_aligned]) % 64 == 0);
    
    // Multiply
    #pragma omp parallel for
    for (size_t i = 0; i < A_rows; ++i) {
        // Aligned Reg storage
        AlignedVector<T> Regs(block);
        
        for (size_t j = 0; j < B_cols; ++j) {
            T const * Arow = &A[i * A_cols_aligned + 0], * Btrow = &Bt[j * B_rows_aligned + 0];
            
            using Reg = typename RegT<T>::type;
            Reg r0 = RegT<T>::zero(), r1 = RegT<T>::zero(), r2 = RegT<T>::zero(), r3 = RegT<T>::zero();
            
            size_t const k_hi = A_cols - A_cols % block;
            
            for (size_t k = 0; k < k_hi; k += block) {
                MulAddReg(&Arow[k + reg_cnt * 0], &Btrow[k + reg_cnt * 0], r0);
                MulAddReg(&Arow[k + reg_cnt * 1], &Btrow[k + reg_cnt * 1], r1);
                MulAddReg(&Arow[k + reg_cnt * 2], &Btrow[k + reg_cnt * 2], r2);
                MulAddReg(&Arow[k + reg_cnt * 3], &Btrow[k + reg_cnt * 3], r3);
            }
            
            StoreReg(&Regs[reg_cnt * 0], r0);
            StoreReg(&Regs[reg_cnt * 1], r1);
            StoreReg(&Regs[reg_cnt * 2], r2);
            StoreReg(&Regs[reg_cnt * 3], r3);
            
            T sum1 = 0, sum2 = 0, sum3 = 0;
            for (size_t k = 0; k < Regs.size(); ++k)
                sum1 += Regs[k];
            
            //for (size_t k = 0; k < A_cols - A_cols % block; ++k) sum3 += Arow[k] * Btrow[k];
            
            for (size_t k = k_hi; k < A_cols; ++k)
                sum2 += Arow[k] * Btrow[k];
            
            C[i * A_rows + j] = sum2 + sum1;
        }
    }
}

#include <random>
#include <thread>
#include <chrono>
#include <type_traits>

template <typename T>
void Test(T const * A, size_t A_rows, size_t A_cols, T const * B, size_t B_rows, size_t B_cols, T const * C, T eps) {
    for (size_t i = 0; i < A_rows / 16; ++i)
        for (size_t j = 0; j < B_cols / 16; ++j) {
            T sum = 0;
            for (size_t k = 0; k < A_cols; ++k)
                sum += A[i * A_cols + k] * B[k * B_cols + j];
            ASSERT_MSG(std::abs(C[i * A_rows + j] - sum) <= eps * A_cols, "i " + std::to_string(i) + " j " + std::to_string(j) +
                " C " + std::to_string(C[i * A_rows + j]) + " ref " + std::to_string(sum));
        }
}

double Time() {
    static auto const gtb = std::chrono::high_resolution_clock::now();
    return std::chrono::duration_cast<std::chrono::duration<double>>(
        std::chrono::high_resolution_clock::now() - gtb).count();
}

template <typename T>
void Run() {
    size_t constexpr A_rows = 1000, A_cols = 1000, B_rows = 1000, B_cols = 1000;
    
    std::string const tname = std::is_same_v<T, float> ? "float" : std::is_same_v<T, double> ?
        "double" : std::is_same_v<T, int32_t> ? "int" : "<unknown>";
    bool const is_int = tname == "int";
    std::mt19937_64 rng{123};
    std::vector<T> A(A_rows * A_cols), B(B_rows * B_cols), C(A_rows * B_cols);
    for (size_t i = 0; i < A.size(); ++i)
        A[i] = is_int ? (int64_t(rng() % (1 << 11)) - (1 << 10)) : (T(int64_t(rng() % (1 << 28)) - (1 << 27)) / T(1 << 27));
    for (size_t i = 0; i < B.size(); ++i)
        B[i] = is_int ? (int64_t(rng() % (1 << 11)) - (1 << 10)) : (T(int64_t(rng() % (1 << 28)) - (1 << 27)) / T(1 << 27));
    auto tim = Time();
    Mul(&A[0], A_rows, A_cols, &B[0], B_rows, B_cols, &C[0]);
    tim = Time() - tim;
    std::cout << std::setw(6) << tname << ": time " << std::fixed << std::setprecision(4) << tim << " sec" << std::endl;
    Test(&A[0], A_rows, A_cols, &B[0], B_rows, B_cols, &C[0], tname == "float" ? T(1e-7) : tname == "double" ? T(1e-15) : T(0));
}

int main() {
    try {
        #if USE_OPENMP
            omp_set_num_threads(std::thread::hardware_concurrency());
        #endif
        Run<float>();
        Run<double>();
        Run<int32_t>();
        return 0;
    } catch (std::exception const & ex) {
        std::cout << "Exception: " << ex.what() << std::endl;
        return -1;
    }
}

Output:

 float: time 0.1569 sec
double: time 0.3168 sec
   int: time 0.1565 sec