c multithreading performance parallel-processing openmp

Speeding up matrix multiplication using SIMD and openMP

I am trying to speed up matrix multiplication so it will have much better performance than the naive implementation. My goal is to speed it up to be 150 times faster. Here are things that I have tried in my implementation so far:

1. Allocating matrix elements in contiguous blocks for better cache efficiency.
1. Transposing the second matrix, which is to be accessed column-wise, to arrange columns into consecutive memory blocks.
1. Using SIMD instructions.
1. Using openMP to parallelize for loops.

After step 1 and 2, my mat mul became 4 times faster than the naive implementation. After using SIMD, it became 17 times faster. After using openMP, it became 56 times faster. I was expecting to see a much larger gain in speed with openMP, like at least a 6 to 8 times boost. What could I be missing? My code roughly looks like this:

#pragma omp parallel for collapse(2)
    for (int i = 0; i < result.rows; i += 1) {
        for (int j = 0; j < result.cols; j += 1) {
          double product = 0.0;
          for (int k = 0; k < matrix1.cols / 4 * 4; k += 4) {
             //Use _mm256_mul_pd and _mm256_add_pd to process 4 elements at a time. 
          }
          //Do _mm256_store_pd and set the product. 
          result.mat[r][c] = product;
          for (int k = matrix1.cols / 4 * 4; k < matrix1.cols; k += 1) {
             //Tail case 
          }
        }
     }

I want to boost my speed to at least 100 times faster. i.e. to be 2 times faster than my current benchmark. How else should I optimize my code?

Solution

Parallelism can only give you so much. Furthermore, the more the sequential code is optimized the lesser will be the perceptual gains from parallelism you are likely to get. Nevertheless, what you can improve -- I have done it in the past and help to improve a lot -- is to divide the matrix multiplication into smaller blocks. Hence, the matrix multiplication is sub-divided into the multiplication of smaller matrices (tiles).

So by dividing the matrix multiplication into smaller blocks where you perform the matrix multiplication of smaller sub-matrices you are improving the use of the cache both the temporal locality and spatial locality. You need to divide the matrices according to the size of the cache levels (e.g., L1, L2, and L3) of the architecture that you are using. You can look in more detail in these slides about cache blocking and Matrix Multiplication. What Every Programmer Should Know About Memory? also has a vectorized cache-blocked matmul in an appendix.

For example, if you have a Cache L3, which is shared among cores you can load multiple columns of the matrix B into the L3 cache, and reuse those values to perform the matrix-multiplication of smaller tiles that will fit in cache L1 and L2. You can go even further, and inside those tiles divide the tiles even further so that you can take advantage of the registers.

After having optimized the memory usage of the matrix-multiplication, you can try to get that extra speedup from multithreading. If you have a cluster of multi-cores you can try to parallelize using MPI + OpenMP, of course at this point you will encounter another bottleneck, communication among processes.

This all depends on the architecture where your code is running, if you have a NUMA architecture then you have to also factor in things like local and non-local memory. You can also explore the GPU route: Matrix-Matrix Multiplication on the GPU with Nvidia CUDA.

Have a look at BLAS to have a good inside into efficient code.