Unexpected and large runtime variations in Eigen for matrix multiplies

I am comparing ways to perform equivalent matrix operations within Eigen, and am getting extraordinarily different runtimes, including some non-intuitive results.

I am comparing three mathematically equivalent forms of the matrix multiplication:

wx * transpose(data)

The three forms I'm comparing are:

result = wx * data.transpose() (straight multiply version)
result.noalias() = wx * data.transpose() (noalias version)
result = (data * wx.transpose()).transpose() (transposed version)

I am also testing using both Column Major and Row Major storage.

With column major storage, the transposed version is significantly faster (an order of magnitude) than both the straight multiply and the no alias version, which are both approximately equal in runtime.

With row major storage, the noalias and the transposed version are both significantly faster than the straight multiply in runtime.

I understand that Eigen uses lazy evaluation, and that the immediate results returned from an operation are often expression templates, and are not the intermediate values. I also understand that matrix * matrix operations will always produce a temporary when they are the last operation on the right hand side, to avoid aliasing issues, hence why I am attempting to speed things up through noalias().

My main questions:

Why is the transposed version always significantly faster, even (in the case of column major storage) when I explicitly state noalias so no temporaries are created?
Why does the (significant) difference in runtime only occur between the straight multiply and the noalias version when using column major storage?

The code I am using for this is below. It is being compiled using gcc 4.9.2, on a Centos 6 install, using the following command line.

g++ eigen_test.cpp -O3 -std=c++11 -o eigen_test -pthread -fopenmp -finline-functions

using Matrix = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>;
  // using Matrix = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
  int wx_rows = 8000;
  int wx_cols = 1000;
  int samples = 1;
  // Eigen::MatrixXf matrix = Eigen::MatrixXf::Random(matrix_rows, matrix_cols);
  Matrix wx = Eigen::MatrixXf::Random(wx_rows, wx_cols);
  Matrix data = Eigen::MatrixXf::Random(samples, wx_cols);

  Matrix result;

  unsigned int iterations = 10000;
  float sum = 0;
  auto before = std::chrono::high_resolution_clock::now();
  for (unsigned int ii = 0; ii < iterations; ++ii)
  {
    result = wx * data.transpose();
    sum += result(result.rows() - 1, result.cols() - 1);
  }
  auto after = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(after - before).count();
  std::cout << "original sum: " << sum << std::endl;
  std::cout << "original time (ms): " << duration << std::endl;
  std::cout << std::endl;

  sum = 0;
  before = std::chrono::high_resolution_clock::now();
  for (unsigned int ii = 0; ii < iterations; ++ii)
  {
    result.noalias() = wx * data.transpose();
    sum += result(wx_rows - 1, samples - 1);
  }
  after = std::chrono::high_resolution_clock::now();
  duration = std::chrono::duration_cast<std::chrono::milliseconds>(after - before).count();
  std::cout << "alias      sum: " << sum << std::endl;
  std::cout << "alias time (ms)     : " << duration << std::endl;
  std::cout << std::endl;

  sum = 0;
  before = std::chrono::high_resolution_clock::now();
  for (unsigned int ii = 0; ii < iterations; ++ii)
  {
    result = (data * wx.transpose()).transpose();
    sum += result(wx_rows - 1, samples - 1);
  }
  after = std::chrono::high_resolution_clock::now();
  duration = std::chrono::duration_cast<std::chrono::milliseconds>(after - before).count();
  std::cout << "new      sum: " << sum << std::endl;
  std::cout << "new time (ms)     : " << duration << std::endl;

Solution

One half of the explanation is because, in the current version of Eigen, multi-threading is achieved by splitting the work over blocks of columns of the result (and the right-hand-side). With only 1 column, multi-threading does not take place. In the column-major case, this explain why cases 1 and 2 underperform. On the other hand, case 3 is evaluated as:

column_major_tmp.noalias() = data * wx.transpose();
result = column_major_tmp.transpose();

and since wx.transpose().cols() is huge, multi-threading is effective.

To understand the row-major case, you also need to know that internally matrix products is implemented for a column-major destination. If the destination is row-major, as in case 2, then the product is transposed, so what really happens is:

row_major_result.transpose().noalias() = data * wx.transpose();

and so we're back to case 3 but without temporary.

This is clearly a limitation of current Eigen's multi-threading implementation for highly unbalanced matrix sizes. Ideally threads should be spread on row-block and/or column-block depending on the size of the matrices at hand.

BTW, you should also compile with -march=native to let Eigen fully exploit your CPU (AVX, FMA, AVX512...).