Vector-Matrix multiplication with Eigen

I want to use the library Eigen to do linear algebra calculations. In particular, I want to multiply a random vector by a random matrix. Here is the code I am using:

#include <iostream>
#include <chrono>
#include <Eigen/Dense>

using namespace Eigen;

int main(){

    Eigen::initParallel();
    Matrix<unsigned int,Dynamic,Dynamic> A; A = Matrix<unsigned int,500,15500>::Random();
    Matrix<unsigned int,Dynamic, Dynamic> s; s= Matrix<unsigned int,1,500>::Random();
    Matrix<unsigned int,Dynamic,Dynamic> b;

    auto t1 = std::chrono::high_resolution_clock::now();

    b=s*A;

    auto t2 = std::chrono::high_resolution_clock::now();
    auto timeMult = std::chrono::duration_cast <std::chrono::microseconds>(t2 - t1).count();
    std::cout << "Result size: " << b.rows() << "x" << b.cols() << std::endl;
    std::cout << "Time for multiplication: " << timeMult << " microseconds" << std::endl;

    return 0;
}

Then, to compile it I do

g++ -I. -Wall -std=c++0x -fopenmp main.cpp

I believe everything works fine (I did not check the actual result) but it seems really slow. To give an idea, I wrote a C++ code that does exactly the same thing and explicitly uses threads, which runs about 54 times faster that the code I pasted above! In particular, on my machine it is 286904 microseconds against 5300 microseconds with my C++ code.

Why is it so slow and how to make it faster?

I am not posting the code I wrote because it is a piece of a much larger software and making a MWE out of it would require a lot of work. Instead, I am going to describe what it does: I defined classes for vectors and matrices which wrap std::vectors, then to do the multiplication I define a certain number of threads, split the matrix in chunks and have each thread calculate the linear combination of the rows according to the coefficients in the vector. Each thread writes its partial result in another row vector, and finally all the vectors are summed together to obtain the final result. Very simple. By the way, I am using 4 threads, even though this value may be optimized.

Solution

Additionally to adding -O2 or -O3 to your compile flags (as pointed out in the comments), you should change the type of s and b to Matrix<unsigned int,1,Dynamic>. If Eigen knows at compile time that one of the factors of a product is a vector, it can use a much faster product implementation. On my machine that changed the execution time from 25392 µs to 4751 µs.

However, you will not benefit from multithreading for matrix-vector products at the moment (Eigen 3.3rc1).