Doesn't see any significant improvement while using parallel block in OpenMP C++

I am receiving an array of Eigen::MatrixXf and Eigen::Matrix4f in realtime. Both of these arrays are having an equal number of elements. All I am trying to do is just multiply elements of both the arrays together and storing the result in another array at the same index.

Please see the code snippet below-

#define COUNT 4

while (all_ok())
{
    Eigen::Matrix4f    trans[COUNT];
    Eigen::MatrixXf  in_data[COUNT];
    Eigen::MatrixXf out_data[COUNT];

    // at each iteration, new data is filled
    // in 'trans' and 'in_data' variables

    #pragma omp parallel num_threads(COUNT)
    {
        #pragma omp for
        for (int i = 0; i < COUNT; i++)
            out_data[i] = trans[i] * in_clouds[i];
    }
}

Please note that COUNT is a constant. The size of trans and in_data is (4 x 4) and (4 x n) respectively, where n is approximately 500,000. In order to parallelize the for loop, I gave OpenMP a try as shown above. However, I don't see any significant improvement in the elapsed time of for loop.

Any suggestions? Any alternatives to perform the same operation, please?

Edit: My idea is to define 4 (=COUNT) threads wherein each of them is taking care of multiplication. In this way, we don't need to create threads every time, I guess!

Solution

Works for me using the following self-contained example, that is, I get a x4 speed up when enabling openmp:

#include <iostream>
#include <bench/BenchTimer.h>
using namespace Eigen;

const int COUNT = 4;

EIGEN_DONT_INLINE
void foo(const Matrix4f *trans, const MatrixXf *in_data, MatrixXf *out_data)
{
  #pragma omp parallel for num_threads(COUNT)
  for (int i = 0; i < COUNT; i++)
    out_data[i] = trans[i] * in_data[i];
}

int main()
{
  Eigen::Matrix4f    trans[COUNT];
  Eigen::MatrixXf  in_data[COUNT];
  Eigen::MatrixXf out_data[COUNT];
  int n = 500000;
  for (int i = 0; i < COUNT; i++)
  {
    trans[i].setRandom();
    in_data[i].setRandom(4,n);
    out_data[i].setRandom(4,n);
  }

  int tries = 3;
  int rep = 1;

  BenchTimer t;

  BENCH(t, tries, rep, foo(trans, in_data, out_data));

  std::cout << " " << t.best(Eigen::REAL_TIMER) << " (" << double(n)*4.*4.*4.*2.e-9/t.best() << " GFlops)\n";

  return 0;
}

So 1) make sure you measure the wallclock time and not the CPU time, and 2) make sure that the products is the bottleneck and not filling in_data.

Finally, for maximal performance don't forget to enable AVX/FMA (e.g., with -march=native), and of course make sure to benchmark with compiler's optimization ON.

For the record, on my computer the above example takes 0.25s without openmp, and 0.065s with.