Search code examples
c++parallel-processingopenmpmatrix-multiplication

How to optimize my C++ OpenMp Matrix Multiplication code


I have written a C++ OpenMp Matrix Multiplication code that multiplies two 1000x1000 matrices. So far I have gotten a 0.700 sec execution time using OpenMp but I want to see if there is other ways I can make it faster using OpenMp?

I appreciate any advice or tips you can give me.

Here is my code:

#include <iostream>
#include <time.h>
#include <omp.h>

using namespace std;

void Multiply()
 {
    //initialize matrices with random numbers
    int aMatrix[1000][1000], i, j;
  for( i = 0; i < 1000; ++i)
  {for( j = 0;  j < 1000; ++j)
     {aMatrix[i][j] = rand();}
  }
    int bMatrix[1000][1000], i1, j2;
  for( i1 = 0; i1 < 1000; ++i1)
  {for( j2 = 0;  j2 < 1000; ++j2)
     {bMatrix[i1][j2] = rand();}
  }
  //Result Matrix
    int product[1000][1000]  = {0};


    for (int row = 0; row < 1000; row++) {
        for (int col = 0; col < 1000; col++) {
            // Multiply the row of A by the column of B to get the row, column of product.
            for (int inner = 0; inner < 1000; inner++) {
                product[row][col] += aMatrix[row][inner] * bMatrix[inner][col];
            }

        }
    
    }
}

int main() {

   time_t begin, end;
    time(&begin);

    Multiply();

  time(&end);
  time_t elapsed = end - begin;

  cout << ("Time measured: %ld seconds.\n", elapsed);

return 0;
}

Solution

  • Following things can be done for speedup:

    1. Using OpenMP for parallelizing external loop, like you did (and like I also did in my following code). Or alternatively using std::async for multi-threading like it was used in another answer.

    2. Transpose B matrix, this will help to increase L1 cache hits, because you will read from sequential memory each B column (or row in transposed variant).

    3. Use vectorized SIMD instructions, this will allow to do several multiplications (and additions) within one CPU cycle. Often compilers do auto-vectorization of your loops well enough through SIMD instructions without your help, but I did it explicitly in my code.

    4. Run several independent SIMD instructions within loop. This will help to occupy whole CPU pipeline of SIMD. I did so in my code by using four SIMD registers r0, r1, r2, r3. In compilers this is usually called loop unrolling.

    5. Align your matrix starting address on 64-bytes boundary. This will help SIMD instructions to do fast aligned read/write.

    6. Align starting address of each matrix row on 64-bytes boundary. I did this in my code by padding each row with zeros till multiple of 64-bytes. This also helps SIMD instructions to do fast aligned read/write.

    In my following code I did all 1. - 6. steps above. Memory 64-bytes alignment I did through implementing AlignmentAllocator that was used in std::vector. Also I did time measurements for float/double/int.

    On my old 4-core laptop I got following time measurements for the case of 1000x1000 matrix multiplying by 1000x1000:

     float: time 0.1569 sec
    double: time 0.3168 sec
       int: time 0.1565 sec
    

    To compare my hardware capabilities I did measurements of another answer of @doug for the case of int:

    Threads w transpose   0.2164 secs.
    

    As one can see my solution is 1.4x times faster that the other answer, I guess due to memory 64-bytes alignment and maybe due to using explicit SIMD (instead of relying on compiler auto-vectorization of a loop).

    To compile my program, don't forget to add -fopenmp -lgomp options (for OpenMP support) and -march=native -O3 -std=c++20 options (for SIMD support, optimizations and standard) if you're compiling under GCC/CLang, while MSVC I guess adds OpenMP automatically and doesn't need any special options (use /O2 /GL /std:c++latest for optimizations and standard in MSVC).

    In my code I only implemented SSE2/SSE4/AVX/AVX2 instructions for SIMD, if you have more powerful machine you may tell me and I implement also FMA/AVX-512, they will give even twice more speed boost.

    My Mul() function is quite generic, it is templated, and you just pass pointers to matrices and row/col count, so your matrices may be stored on calling side in any way (through std::vector or std::array or plain 2D array).

    At start of Run() function you may change number of rows and columns if you need a bigger test. Notice that all my functions support any rows and columns, you may even multiply matrix of size 1234x2345 by 2345x3456.

    Try it online!

    #include <cstdint>
    #include <cstring>
    #include <stdexcept>
    #include <iostream>
    #include <iomanip>
    #include <vector>
    #include <memory>
    #include <string>
    
    #include <immintrin.h>
    
    #define USE_OPENMP 1
    #define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg '" + std::string(msg) + "'."); }
    #define ASSERT(cond) ASSERT_MSG(cond, "")
    #if defined(_MSC_VER)
        #define IS_MSVC 1
    #else
        #define IS_MSVC 0
    #endif
    
    #if USE_OPENMP
        #include <omp.h>
    #endif
    
    template <typename T, std::size_t N>
    class AlignmentAllocator {
      public:
        typedef T value_type;
        typedef std::size_t size_type;
        typedef std::ptrdiff_t difference_type;
        typedef T * pointer;
        typedef const T * const_pointer;
        typedef T & reference;
        typedef const T & const_reference;
    
      public:
        inline AlignmentAllocator() throw() {}
        template <typename T2> inline AlignmentAllocator(const AlignmentAllocator<T2, N> &) throw() {}
        inline ~AlignmentAllocator() throw() {}
        inline pointer adress(reference r) { return &r; }
        inline const_pointer adress(const_reference r) const { return &r; }
        inline pointer allocate(size_type n);
        inline void deallocate(pointer p, size_type);
        inline void construct(pointer p, const value_type & wert);
        inline void destroy(pointer p) { p->~value_type(); }
        inline size_type max_size() const throw() { return size_type(-1) / sizeof(value_type); }
        template <typename T2> struct rebind { typedef AlignmentAllocator<T2, N> other; };
        bool operator!=(const AlignmentAllocator<T, N> & other) const { return !(*this == other); }
        bool operator==(const AlignmentAllocator<T, N> & other) const { return true; }
    };
    
    template <typename T, std::size_t N>
    inline typename AlignmentAllocator<T, N>::pointer AlignmentAllocator<T, N>::allocate(size_type n) {
        #if IS_MSVC
            auto p = (pointer)_aligned_malloc(n * sizeof(value_type), N);
        #else
            auto p = (pointer)std::aligned_alloc(N, n * sizeof(value_type));
        #endif
        ASSERT(p);
        return p;
    }
    template <typename T, std::size_t N>
    inline void AlignmentAllocator<T, N>::deallocate(pointer p, size_type) {
        #if IS_MSVC
            _aligned_free(p);
        #else
            std::free(p);
        #endif
    }
    template <typename T, std::size_t N>
    inline void AlignmentAllocator<T, N>::construct(pointer p, const value_type & wert) {
        new (p) value_type(wert);
    }
    
    template <typename T>
    using AlignedVector = std::vector<T, AlignmentAllocator<T, 64>>;
    
    template <typename T>
    struct RegT;
    
    #ifdef __AVX__
        template <> struct RegT<float> { static size_t constexpr bisize = 256; using type = __m256; static type zero() { return _mm256_setzero_ps(); } };
        template <> struct RegT<double> { static size_t constexpr bisize = 256; using type = __m256d; static type zero() { return _mm256_setzero_pd(); } };
        
        inline void MulAddReg(float const * a, float const * b, __m256 & c) {
            c = _mm256_add_ps(c, _mm256_mul_ps(_mm256_load_ps(a), _mm256_load_ps(b)));
        }
        inline void MulAddReg(double const * a, double const * b, __m256d & c) {
            c = _mm256_add_pd(c, _mm256_mul_pd(_mm256_load_pd(a), _mm256_load_pd(b)));
        }
        
        inline void StoreReg(float * dst, __m256 const & src) { _mm256_store_ps(dst, src); }
        inline void StoreReg(double * dst, __m256d const & src) { _mm256_store_pd(dst, src); }
    #else // SSE2
        template <> struct RegT<float> { static size_t constexpr bisize = 128; using type = __m128; static type zero() { return _mm_setzero_ps(); } };
        template <> struct RegT<double> { static size_t constexpr bisize = 128; using type = __m128d; static type zero() { return _mm_setzero_pd(); } };
    
        inline void MulAddReg(float const * a, float const * b, __m128 & c) {
            c = _mm_add_ps(c, _mm_mul_ps(_mm_load_ps(a), _mm_load_ps(b)));
        }
        inline void MulAddReg(double const * a, double const * b, __m128d & c) {
            c = _mm_add_pd(c, _mm_mul_pd(_mm_load_pd(a), _mm_load_pd(b)));
        }
        
        inline void StoreReg(float * dst, __m128 const & src) { _mm_store_ps(dst, src); }
        inline void StoreReg(double * dst, __m128d const & src) { _mm_store_pd(dst, src); }
    #endif
    
    #ifdef __AVX2__
        template <> struct RegT<int32_t> { static size_t constexpr bisize = 256; using type = __m256i; static type zero() { return _mm256_setzero_si256(); } };
        //template <> struct RegT<int64_t> { static size_t constexpr bisize = 256; using type = __m256i; static type zero() { return _mm256_setzero_si256(); } };
    
        inline void MulAddReg(int32_t const * a, int32_t const * b, __m256i & c) {
            c = _mm256_add_epi32(c, _mm256_mullo_epi32(_mm256_load_si256((__m256i*)a), _mm256_load_si256((__m256i*)b)));
        }
        //inline void MulAddReg(int64_t const * a, int64_t const * b, __m256i & c) {
        //    c = _mm256_add_epi64(c, _mm256_mullo_epi64(_mm256_load_si256((__m256i*)a), _mm256_load_si256((__m256i*)b)));
        //}
        
        inline void StoreReg(int32_t * dst, __m256i const & src) { _mm256_store_si256((__m256i*)dst, src); }
        //inline void StoreReg(int64_t * dst, __m256i const & src) { _mm256_store_si256((__m256i*)dst, src); }
    #else // SSE2
        template <> struct RegT<int32_t> { static size_t constexpr bisize = 128; using type = __m128i; static type zero() { return _mm_setzero_si128(); } };
        //template <> struct RegT<int64_t> { static size_t constexpr bisize = 128; using type = __m128i; static type zero() { return _mm_setzero_si128(); } };
    
        inline void MulAddReg(int32_t const * a, int32_t const * b, __m128i & c) {
            c = _mm_add_epi32(c, _mm_mullo_epi32(_mm_load_si128((__m128i*)a), _mm_load_si128((__m128i*)b)));
        }
        //inline void MulAddReg(int64_t const * a, int64_t const * b, __m128i & c) {
        //    c = _mm_add_epi64(c, _mm_mullo_epi64(_mm_load_si128((__m128i*)a), _mm_load_si128((__m128i*)b)));
        //}
        
        inline void StoreReg(int32_t * dst, __m128i const & src) { _mm_store_si128((__m128i*)dst, src); }
        //inline void StoreReg(int64_t * dst, __m128i const & src) { _mm_store_si128((__m128i*)dst, src); }
    #endif    
    
    template <typename T>
    void Mul(T const * A0, size_t A_rows, size_t A_cols, T const * B0, size_t B_rows, size_t B_cols, T * C) {
        size_t constexpr reg_cnt = RegT<T>::bisize / 8 / sizeof(T), block = 4 * reg_cnt;
        ASSERT(A_cols == B_rows);
        size_t const A_cols_aligned = (A_cols + block - 1) / block * block, B_rows_aligned = (B_rows + block - 1) / block * block;
        
        // Copy aligned A
        AlignedVector<T> Av(A_rows * A_cols_aligned);
        for (size_t i = 0; i < A_rows; ++i)
            std::memcpy(&Av[i * A_cols_aligned], &A0[i * A_cols], sizeof(Av[0]) * A_cols);
        T const * A = Av.data();
        // Transpose B
        AlignedVector<T> Bv(B_cols * B_rows_aligned);
        for (size_t j = 0; j < B_cols; ++j)
            for (size_t i = 0; i < B_rows; ++i)
                Bv[j * B_rows_aligned + i] = B0[i * B_cols + j];
        T const * Bt = Bv.data();
        ASSERT(uintptr_t(A) % 64 == 0 && uintptr_t(Bt) % 64 == 0);
        ASSERT(uintptr_t(&A[A_cols_aligned]) % 64 == 0 && uintptr_t(&Bt[B_rows_aligned]) % 64 == 0);
        
        // Multiply
        #pragma omp parallel for
        for (size_t i = 0; i < A_rows; ++i) {
            // Aligned Reg storage
            AlignedVector<T> Regs(block);
            
            for (size_t j = 0; j < B_cols; ++j) {
                T const * Arow = &A[i * A_cols_aligned + 0], * Btrow = &Bt[j * B_rows_aligned + 0];
                
                using Reg = typename RegT<T>::type;
                Reg r0 = RegT<T>::zero(), r1 = RegT<T>::zero(), r2 = RegT<T>::zero(), r3 = RegT<T>::zero();
                
                size_t const k_hi = A_cols - A_cols % block;
                
                for (size_t k = 0; k < k_hi; k += block) {
                    MulAddReg(&Arow[k + reg_cnt * 0], &Btrow[k + reg_cnt * 0], r0);
                    MulAddReg(&Arow[k + reg_cnt * 1], &Btrow[k + reg_cnt * 1], r1);
                    MulAddReg(&Arow[k + reg_cnt * 2], &Btrow[k + reg_cnt * 2], r2);
                    MulAddReg(&Arow[k + reg_cnt * 3], &Btrow[k + reg_cnt * 3], r3);
                }
                
                StoreReg(&Regs[reg_cnt * 0], r0);
                StoreReg(&Regs[reg_cnt * 1], r1);
                StoreReg(&Regs[reg_cnt * 2], r2);
                StoreReg(&Regs[reg_cnt * 3], r3);
                
                T sum1 = 0, sum2 = 0, sum3 = 0;
                for (size_t k = 0; k < Regs.size(); ++k)
                    sum1 += Regs[k];
                
                //for (size_t k = 0; k < A_cols - A_cols % block; ++k) sum3 += Arow[k] * Btrow[k];
                
                for (size_t k = k_hi; k < A_cols; ++k)
                    sum2 += Arow[k] * Btrow[k];
                
                C[i * A_rows + j] = sum2 + sum1;
            }
        }
    }
    
    #include <random>
    #include <thread>
    #include <chrono>
    #include <type_traits>
    
    template <typename T>
    void Test(T const * A, size_t A_rows, size_t A_cols, T const * B, size_t B_rows, size_t B_cols, T const * C, T eps) {
        for (size_t i = 0; i < A_rows / 16; ++i)
            for (size_t j = 0; j < B_cols / 16; ++j) {
                T sum = 0;
                for (size_t k = 0; k < A_cols; ++k)
                    sum += A[i * A_cols + k] * B[k * B_cols + j];
                ASSERT_MSG(std::abs(C[i * A_rows + j] - sum) <= eps * A_cols, "i " + std::to_string(i) + " j " + std::to_string(j) +
                    " C " + std::to_string(C[i * A_rows + j]) + " ref " + std::to_string(sum));
            }
    }
    
    double Time() {
        static auto const gtb = std::chrono::high_resolution_clock::now();
        return std::chrono::duration_cast<std::chrono::duration<double>>(
            std::chrono::high_resolution_clock::now() - gtb).count();
    }
    
    template <typename T>
    void Run() {
        size_t constexpr A_rows = 1000, A_cols = 1000, B_rows = 1000, B_cols = 1000;
        
        std::string const tname = std::is_same_v<T, float> ? "float" : std::is_same_v<T, double> ?
            "double" : std::is_same_v<T, int32_t> ? "int" : "<unknown>";
        bool const is_int = tname == "int";
        std::mt19937_64 rng{123};
        std::vector<T> A(A_rows * A_cols), B(B_rows * B_cols), C(A_rows * B_cols);
        for (size_t i = 0; i < A.size(); ++i)
            A[i] = is_int ? (int64_t(rng() % (1 << 11)) - (1 << 10)) : (T(int64_t(rng() % (1 << 28)) - (1 << 27)) / T(1 << 27));
        for (size_t i = 0; i < B.size(); ++i)
            B[i] = is_int ? (int64_t(rng() % (1 << 11)) - (1 << 10)) : (T(int64_t(rng() % (1 << 28)) - (1 << 27)) / T(1 << 27));
        auto tim = Time();
        Mul(&A[0], A_rows, A_cols, &B[0], B_rows, B_cols, &C[0]);
        tim = Time() - tim;
        std::cout << std::setw(6) << tname << ": time " << std::fixed << std::setprecision(4) << tim << " sec" << std::endl;
        Test(&A[0], A_rows, A_cols, &B[0], B_rows, B_cols, &C[0], tname == "float" ? T(1e-7) : tname == "double" ? T(1e-15) : T(0));
    }
    
    int main() {
        try {
            #if USE_OPENMP
                omp_set_num_threads(std::thread::hardware_concurrency());
            #endif
            Run<float>();
            Run<double>();
            Run<int32_t>();
            return 0;
        } catch (std::exception const & ex) {
            std::cout << "Exception: " << ex.what() << std::endl;
            return -1;
        }
    }
    

    Output:

     float: time 0.1569 sec
    double: time 0.3168 sec
       int: time 0.1565 sec