Task for matrix-vector multiplication and adding by openmp parallel

I would like to compute the following matrix-vector multiplication and adding operation as

y = (A + C + C^T + R + R^T + D1 + D1^T + D2 + D2^T)x

Can I use the tasking parallel in openmp to accelerate this operation? The code is as follow

    y.setZero();
    int ny = y.size();
#pragma omp parallel shared(x)
{
#pragma omp single
    {
        VectorXd yA(ny),yC(ny), yCt(ny), yR(ny), yRt(ny), yD1(ny), yD1t(ny), yD2(ny), yD2t(ny);
        yA.setZero();
        yC.setZero();
        yCt.setZero();
        yR.setZero();
        yRt.setZero();
        yD1.setZero();
        yD1t.setZero();
        yD2.setZero();
        yD2t.setZero();

    #pragma omp task
        yA = A * x;
    #pragma omp task
        yC = C * x;
    #pragma omp task
        yCt = C.transpose() * x;
    #pragma omp task
        yR = R * x;
    #pragma omp task
        yRt = R.transpose() * x;
    #pragma omp task
        yD1 = D1 * x;
    #pragma omp task
        yD1t = D1.transpose() * x;
    #pragma omp task
        yD2 = D2 * x;
    #pragma omp task
        yD2t = D2.transpose() * x;

    #pragma omp taskwait
        y = yA + yC + yCt + yR + yRt + yD1 + yD1t + yD2 + yD2t;
    }
};

This code is based on the Eigen library. Compared with not using the openmp, I can not observe a noticeable acceleration in my computer. But I think it can be paralleled by openmp. So I ask this question. I would appreciate if anybody can answer my question.

Solution

Can I use the tasking parallel in openmp to accelerate this operation? The code is as follow Blockquote

Yes. Though in your code, I can see 3 things to improve.

1) Data sharing - When starting with OpenMP, I recommend using default(none), shared(...) or firstprivate(...) explicitely on your tasks - so you understand how your variable are actually copied.

2) Scaling - In term of performances, your code will not scales above 9 cores, since you only express 9 tasks.
When doing algebra, you can taskify applications through block operations.
For instance, you can split A = B + C into 4 independant block operations (using Eigen interface...)

A(0, 0, n/2, n/2) = B(0, 0, n/2, n/2) + C(0, 0, n/2, n/2)
A(0, n/2, n/2, n/2) = B(0, n/2, n/2, n/2) + C(0, n/2, n/2, n/2)
A(n/2, 0, n/2, n/2) = B(n/2, 0, n/2, n/2) + C(n/2, 0, n/2, n/2)
A(n/2, n/2, n/2, n/2) = B(n/2, n/2, n/2, n/2) + C(n/2, n/2, n/2, n/2)

The block size here is n/2 and defines your tasks granularity

3) Synchronousity - To ensure correct order of execution, you used a # pragma omp taskwait - which is an explicit barrier that waits for completion of previously spawned tasks.
Since OpenMP 4.0, tasks can have dependences to guarantee order of execution on a finer level.

Mixing things up - I made a micro-benchmark with 3 versions of your problem - source

sequential - the computation is done on 1 core
taskwait - your version
block - that uses Eigen block operations, and OpenMP dependences.
I splited the computation into 2 operations and taskified each operations with m blocks.
- add - M := A + C + C^T + R + R^T + D1 + D1^T + D2 + D2^T
- mult - y := M.x
  On this figure, is the dependency task graph for m=2 - the add(0, 0) task works on the top-left block, the add(1, 0) task works on the top-right block, etc... - the mult tasks depends from the add tasks working on the same line of blocks - the norm task is a correctness check.

I run it on 16 cores, with n=4,096 and m=16 (n° of blocks per line-wise operation). Here is my result

sequential version took 1.28707 s.
taskwait version took 0.177616 s.(SUCCESS)
task version took 0.128707 s.(SUCCESS)

I run it again, but on 64 cores this time, and with m=64

sequential version took 1.31316 s.
taskwait version took 0.170817 s.(SUCCESS)
task version took 0.0785489 s.(SUCCESS)

The "block version" scales with cores, but the "taskwait" version does not.

Here is the Gantt chart of the 1st execution (m=16). The big blue task correspond to the sequential-version execution, the red ones to the taskwait-version execution. The purple and pink tasks correspond respectively to the add and mult tasks of the block-version.

As you can see, the taskwait-version only uses 9 cores out of 16, while the block-version can use every cores.

a zoom on the block-version, and enabling dependences rendering with arrows

The figures were generated using Multi-Processor-Computing (MPC) - which implements an OpenMP runtime and tracing tools.