How to enable parallelization of sparse matrix/dense vector multiplication written in Eigen source code?

According to Eigen documentation, as long as the proper compile flag is set and the OMP_NUM_THREADS=x is defined, all sparse matrix/dense vector multiplications will run in parallel, no matter where the multiplication takes place. After doing those, however, I observed that only 1 core all the time was used by inspecting the htop.

I'm concerning line 58 and line 98 in the following source code, where sm/dv multiplications take place. On thing to note is that following code is a part of the unsupported iterative solver module of Eigen, but I don't think this fact gives rise to the failure of parallelization.

https://eigen.tuxfamily.org/dox/unsupported/MINRES_8h_source.html

The platform is Xeon Gold 6126, and the compile flags I used are

CC=g++
FLAGS=-std=c++11 -m64 -O3 -fopenmp -march=skylake-avx512

I submit the job by the following script

#!/bin/bash

#something
#SBATCH -n 8
#something

OMP_NUM_THREADS=8 ./my_executable

which I assume has properly set up the openmp.

I roughly recall that some one mentioned that in order to take advantage of multiple cores, the sparse matrix has to be filled fully, instead of just the upper/lower triangle. I indeed only filled the upper triangle only, and not sure if this is the cause.

Any suggestion what I missed? Thanks in advance.

Solution

This is not correct:

as long as the proper compile flag is set and the OMP_NUM_THREADS=x is defined, all sparse matrix/dense vector multiplications will run in parallel

As described in the documentation, thread prallelization with OpenMP is available for row-major-sparse * dense vector/matrix products

The default storage order of a SparseMatrix in Eigen is column major, for which the parallelization does not apply. For parallel MVPs with OpenMP, a double precision sparse matrix should be defined like this:

Eigen::SparseMatrix<double, Eigen::RowMajor>

BTW, it is not necessary to specify OMP_NUM_THREADS. This value is set by default to the maximum available threads.