Search code examples
c++multithreadingopenmpeigen

Doesn't see any significant improvement while using parallel block in OpenMP C++


I am receiving an array of Eigen::MatrixXf and Eigen::Matrix4f in realtime. Both of these arrays are having an equal number of elements. All I am trying to do is just multiply elements of both the arrays together and storing the result in another array at the same index.

Please see the code snippet below-

#define COUNT 4

while (all_ok())
{
    Eigen::Matrix4f    trans[COUNT];
    Eigen::MatrixXf  in_data[COUNT];
    Eigen::MatrixXf out_data[COUNT];

    // at each iteration, new data is filled
    // in 'trans' and 'in_data' variables

    #pragma omp parallel num_threads(COUNT)
    {
        #pragma omp for
        for (int i = 0; i < COUNT; i++)
            out_data[i] = trans[i] * in_clouds[i];
    }
}

Please note that COUNT is a constant. The size of trans and in_data is (4 x 4) and (4 x n) respectively, where n is approximately 500,000. In order to parallelize the for loop, I gave OpenMP a try as shown above. However, I don't see any significant improvement in the elapsed time of for loop.

Any suggestions? Any alternatives to perform the same operation, please?

Edit: My idea is to define 4 (=COUNT) threads wherein each of them is taking care of multiplication. In this way, we don't need to create threads every time, I guess!


Solution

  • Works for me using the following self-contained example, that is, I get a x4 speed up when enabling openmp:

    #include <iostream>
    #include <bench/BenchTimer.h>
    using namespace Eigen;
    
    const int COUNT = 4;
    
    EIGEN_DONT_INLINE
    void foo(const Matrix4f *trans, const MatrixXf *in_data, MatrixXf *out_data)
    {
      #pragma omp parallel for num_threads(COUNT)
      for (int i = 0; i < COUNT; i++)
        out_data[i] = trans[i] * in_data[i];
    }
    
    int main()
    {
      Eigen::Matrix4f    trans[COUNT];
      Eigen::MatrixXf  in_data[COUNT];
      Eigen::MatrixXf out_data[COUNT];
      int n = 500000;
      for (int i = 0; i < COUNT; i++)
      {
        trans[i].setRandom();
        in_data[i].setRandom(4,n);
        out_data[i].setRandom(4,n);
      }
    
      int tries = 3;
      int rep = 1;
    
      BenchTimer t;
    
      BENCH(t, tries, rep, foo(trans, in_data, out_data));
    
      std::cout << " " << t.best(Eigen::REAL_TIMER) << " (" << double(n)*4.*4.*4.*2.e-9/t.best() << " GFlops)\n";
    
      return 0;
    }
    

    So 1) make sure you measure the wallclock time and not the CPU time, and 2) make sure that the products is the bottleneck and not filling in_data.

    Finally, for maximal performance don't forget to enable AVX/FMA (e.g., with -march=native), and of course make sure to benchmark with compiler's optimization ON.

    For the record, on my computer the above example takes 0.25s without openmp, and 0.065s with.