c++performance armadillo intel-mkl eigen3

Eigen C++ running slower when compiled with MKL

I have recently started to use Eigen (version 3.3.1), running a benchmark against Armadillo on a simple matrix operation at the core of an OLS regression, that is computing the inverse of the product of a matrix by itself, I noticed that Eigen was running slower when compiled with MKL library than without it for this kind of operation. I was wondering if my compilation instructions were wrong. I also tried to realize this operation calling directly MKL BLAS and LAPACK routines and got a much faster result, as fast as Armadillo. I cannot explain such a poor performance especially for float type.

I wrote the code below to realise this benchmark:

#define ARMA_DONT_USE_WRAPPER
#define ARMA_NO_DEBUG
#include <armadillo>

#define EIGEN_NO_DEBUG
#define EIGEN_NO_STATIC_ASSERT
#define EIGEN_USE_MKL_ALL
#include <Eigen/Dense>

template <typename T>
using Matrix = Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic>;

#ifdef USE_FLOAT
using T = float;
#else
using T = double;
#endif

int main()
{
    arma::wall_clock timer;

    int niter = 1000000;
    int n = 1000;
    int k = 20;

    arma::Mat<T> Xa = arma::cumsum(arma::randn<arma::Mat<T>>(n, k));
    Matrix<T> Xe = Matrix<T>::Map(Xa.memptr(), Xa.n_rows, Xa.n_cols);

    // Armadillo compiled with MKL
    timer.tic();
    for (int i = 0; i < niter; ++i) {
        arma::Mat<T> iX2a = (Xa.t() * Xa).i();
    }
    std::cout << "...Elapsed time: " << timer.toc() << "\n";

    // Eigen compiled with MKL
    timer.tic();
    for (int i = 0; i < niter; ++i) {
        Matrix<T> iX2e = (Xe.transpose() * Xe).inverse();
    }
    std::cout << "...Elapsed time: " << timer.toc() << "\n";*/

    // Eigen Matrix with MKL routines
    timer.tic();
    for (int i = 0; i < niter; ++i) {
        Matrix<T> iX2e =  Matrix<T>::Zero(k, k);
        // first stage => computing square matrix trans(X) * X
        #ifdef USE_FLOAT
        cblas_ssyrk(CblasColMajor, CblasLower, CblasTrans, k, n, 1.0, &Xe(0,0), n, 0.0, &iX2e(0,0), k);
        #else
        cblas_dsyrk(CblasColMajor, CblasLower, CblasTrans, k, n, 1.0, &Xe(0,0), n, 0.0, &iX2e(0,0), k);
        #endif
        // getting upper part  
        for (int i = 0; i < k; ++i)
            for (int j = i + 1; j < k; ++j)
                iX2e(i, j) = iX2e(j, i);
        // second stage => inverting square matrix
        // initializing pivots
        int* ipiv = new int[k];
        // factorizing matrix
        #ifdef USE_FLOAT 
        LAPACKE_sgetrf(LAPACK_COL_MAJOR, k, k, &iX2e(0,0), k, ipiv);
        #else
        LAPACKE_dgetrf(LAPACK_COL_MAJOR, k, k, &iX2e(0,0), k, ipiv); 
        #endif
        // computing the matrix inverse
        #ifdef USE_FLOAT 
        LAPACKE_sgetri(LAPACK_COL_MAJOR, k, &iX2e(0,0), k, ipiv);
        #else
        LAPACKE_dgetri(LAPACK_COL_MAJOR, k, &iX2e(0,0), k, ipiv);
        #endif
        delete[] ipiv;
    }
    std::cout << "...Elapsed time: " << timer.toc() << "\n";
}

I compile this file called test.cpp with:

g++ -std=c++14 -Wall -O3 -march=native -DUSE_FLOAT test.cpp -o run -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_gf_lp64 -lmkl_sequential -lmkl_core

I get the following results (on Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz)

for double type:

Armadillo with MKL => 64.0s

Eigen with MKL => 72.2s

Eigen alone => 68.7s

Pure MKL => 64.9s

for float type:

Armadillo with MKL => 38.2s

Eigen with MKL => 61.1s

Eigen alone => 42.6s

Pure MKL => 38.9s

NB: I run this test for a project which will not use very large matrix, I do not need parallelization at this level, my biggest matrix will probably be 2000 rows for 25 columns, moreover I will need to go parallel at a higher level so I want to avoid any kind of nested parallelism which could slow down my code.

Solution

As I said in my comment, make sure to disable turbo-boost when benchmarking.

As a side note and for future reference, your current Eigen's code will call gemm instead of syrk. You can explicitly ask for the later with:

Matrix<T> tmp = Matrix<T>::Zero(k, k);
tmp.selfadjointView<Eigen::Lower>().rankUpdate(Xe.transpose());
tmp.triangularView<Eigen::Upper>() = tmp.transpose().triangularView<Eigen::Lower>();
iX2e = tmp.inverse();

For such small matrices I cannot really see much differences though.