I would like to compute the following matrix-vector multiplication and adding operation as
y = (A + C + C^T + R + R^T + D1 + D1^T + D2 + D2^T)x
Can I use the tasking parallel in openmp to accelerate this operation? The code is as follow
y.setZero();
int ny = y.size();
#pragma omp parallel shared(x)
{
#pragma omp single
{
VectorXd yA(ny),yC(ny), yCt(ny), yR(ny), yRt(ny), yD1(ny), yD1t(ny), yD2(ny), yD2t(ny);
yA.setZero();
yC.setZero();
yCt.setZero();
yR.setZero();
yRt.setZero();
yD1.setZero();
yD1t.setZero();
yD2.setZero();
yD2t.setZero();
#pragma omp task
yA = A * x;
#pragma omp task
yC = C * x;
#pragma omp task
yCt = C.transpose() * x;
#pragma omp task
yR = R * x;
#pragma omp task
yRt = R.transpose() * x;
#pragma omp task
yD1 = D1 * x;
#pragma omp task
yD1t = D1.transpose() * x;
#pragma omp task
yD2 = D2 * x;
#pragma omp task
yD2t = D2.transpose() * x;
#pragma omp taskwait
y = yA + yC + yCt + yR + yRt + yD1 + yD1t + yD2 + yD2t;
}
};
This code is based on the Eigen library. Compared with not using the openmp, I can not observe a noticeable acceleration in my computer. But I think it can be paralleled by openmp. So I ask this question. I would appreciate if anybody can answer my question.
Can I use the tasking parallel in openmp to accelerate this operation? The code is as follow Blockquote
Yes. Though in your code, I can see 3 things to improve.
1) Data sharing -
When starting with OpenMP, I recommend using default(none), shared(...) or firstprivate(...)
explicitely on your tasks - so you understand how your variable are actually copied.
2) Scaling -
In term of performances, your code will not scales above 9 cores, since you only express 9 tasks.
When doing algebra, you can taskify applications through block operations.
For instance, you can split A = B + C
into 4 independant block operations (using Eigen interface...)
The block size here is n/2
and defines your tasks granularity
3) Synchronousity -
To ensure correct order of execution, you used a # pragma omp taskwait
- which is an explicit barrier that waits for completion of previously spawned tasks.
Since OpenMP 4.0, tasks can have dependences to guarantee order of execution on a finer level.
Mixing things up - I made a micro-benchmark with 3 versions of your problem - source
m
blocks.
add
- M := A + C + C^T + R + R^T + D1 + D1^T + D2 + D2^T
mult
- y := M.x
m=2
- the add(0, 0)
task works on the top-left block, the add(1, 0)
task works on the top-right block, etc... - the mult
tasks depends from the add
tasks working on the same line of blocks - the norm
task is a correctness check.I run it on 16 cores, with n=4,096
and m=16
(n° of blocks per line-wise operation). Here is my result
sequential version took 1.28707 s.
taskwait version took 0.177616 s.(SUCCESS)
task version took 0.128707 s.(SUCCESS)
I run it again, but on 64 cores this time, and with m=64
sequential version took 1.31316 s.
taskwait version took 0.170817 s.(SUCCESS)
task version took 0.0785489 s.(SUCCESS)
The "block version" scales with cores, but the "taskwait" version does not.
Here is the Gantt chart of the 1st execution (m=16
).
The big blue task correspond to the sequential-version execution, the red ones to the taskwait-version execution. The purple and pink tasks correspond respectively to the add
and mult
tasks of the block-version.
As you can see, the taskwait-version only uses 9 cores out of 16, while the block-version can use every cores.
a zoom on the block-version, and enabling dependences rendering with arrows
The figures were generated using Multi-Processor-Computing (MPC) - which implements an OpenMP runtime and tracing tools.