Suppose I have the following code in which I have a serial version of doing computations and the attempted parallel version:
double getSum(std::vector<double>& xv, std::vector<double>& yv);
int main()
{
int nrows = 600; int ncols = 200; int Lvec = 10000;
std::vector<std::vector<std::vector<double>>> vec3;
std::vector<std::vector<double>> xarr(nrows, std::vector<double>(ncols, 0));
srand(time(NULL));
for (int k = 1; k <= nrows; k++) {
//vec3 is a 3d vector, where each component stores
//2d vectors of variable sizes determined by ncols1,
//which ranges from 0-180
int ncols1= (rand() % 10) *20;
std::vector<std::vector<double>> vecCol;
for (int j = 1; j <= ncols1; j++) {
std::vector<double> vec0;
for (int m = 1; m <= Lvec; m++) {
vec0.push_back((rand() % 10) / 10.0);
}
vecCol.push_back(vec0);
}
vec3.push_back(vecCol);
}
//serial version
std::vector<std::vector<double>> xarrSerial(nrows, std::vector<double>(ncols, 0));
double xnow, xprev;
double t0 = -omp_get_wtime();
for (int k = 1; k <= nrows; k++) {
std::vector<std::vector<double>> vecCol = vec3[k - 1];
for (int j = 1; j <= vecCol.size(); j++) {
if (j == 1) {
xprev = 0;
xarrSerial[k - 1][j - 1] = xprev;
}
else {
xnow = getSum(vec3[k - 1][j - 2], vec3[k - 1][j - 1]);
if (xnow > xprev) {
xarrSerial[k - 1][j - 1] = xnow;
}
else {
xarrSerial[k - 1][j - 1] = -1 * xnow;
}
xprev = xnow;
}
}
}
t0 += omp_get_wtime();
//parallel version
xprev=0; xnow=0;
double t = -omp_get_wtime();
#pragma omp parallel for
for (int k = 1; k <= nrows; k++) {
std::vector<std::vector<double>> vecCol = vec3[k - 1];
for (int j = 1; j <= vecCol.size(); j++) {
if (j == 1) {
xprev = 0;
xarr[k - 1][j - 1] = xprev;
}
else {
//add vec3[k - 1][j - 2] and vec3[k - 1][j - 1]
//then compare with xprev
xnow = getSum(vec3[k - 1][j - 2], vec3[k - 1][j - 1]);
if (xnow > xprev) {
xarr[k - 1][j - 1] = xnow;
}
else {
xarr[k - 1][j - 1] = -1 * xnow;
}
xprev = xnow;
}
}
}
t += omp_get_wtime();
std::cout << "xarrSerial\n";
for (int k = 1; k <= 10; k++) {
for (int j = 1; j <= 4; j++) {
std::cout << xarrSerial[k - 1][j - 1] << ", ";
}
std::cout << "\n";
}
std::cout << "xarr\n";
for (int k = 1; k <= 10; k++) {
for (int j = 1; j <= 4; j++) {
std::cout << xarr[k - 1][j - 1] << ", ";
}
std::cout << "\n";
}
std::cout << "\n";
std::cout << "t0: " << t0 << std::endl;
std::cout << "t: " << t << std::endl;
return 0;
}
double getSum(std::vector<double>& xv, std::vector<double>& yv)
{
double out=0.0;
for (int i = 0; i < xv.size(); i++) {
out = xv[i]*yv[i]+out;
}
return out;
}
For the parallel version, I can see how #pragma omp parallel for
isn't used correctly because each computation depends on the previous one in the iteration as can be seen in:
xnow = getSum(vec3[k - 1][j - 2], vec3[k - 1][j - 1]);
if (xnow > xprev) {
xarr[k - 1][j - 1] = xnow;
}
else {
xarr[k - 1][j - 1] = -1 * xnow;
}
xprev = xnow;
I confirmed that the parallel version is not correctly. Although random values are used in rand()
, for one example, the output I got was
xarrSerial
(Serial version):
0 2047.63 -2040.89 -2018.98
0 2004.31 2031.86 2058.08
...
and in the parallel version, xarr
returned
0 2047.63 -2040.89 -2018.98
0 -2004.31 2031.86 2058.08
I want xarr
to be the same as xarrSerial
, but the negative sign is clearly wrong in -2004.31
Also, the parallel version is not noticeably faster than the serial version, as the serial version took 2.78sec but the parallel version took 2.54sec, and my computer has 40 threads
What is the correct way to use OpenMP to parallelize this? Or can this not be parallelized with OpenMP because of the if (xnow > xprev)
?
TL;DR: protect the variables xprev
and xnow
with a firstprivate
cause or use local variables.
At first glance, the loops seem inherently sequential due to the inter-iterations dependency on xprev
and xnow
between iterations. But if we look carefully, there is not actually inter-iterations dependencies since xprev
is initialized within the inner-loop. However, when you put a #pragma omp parallel for
, you tell to the compiler that there is no dependencies between the loop iterations: the code writer is responsible for ensuring that so that OpenMP can generate a correct code.
By default, variables outside the #pragma omp parallel for
are considered shared among the threads. You have to tell to OpenMP explicitly that it is not the case here using the cause firstprivate(xprev, xnow)
.
The best practice in programming is to reduce the scope of variables to improve the readability of the code and as well as to be able to better track dependencies.
So, please move xprev
and xnow
in the loops.
The line std::vector<std::vector<double>> vecCol = vec3[k - 1];
slows down the code a lot because it involves a deep copy. Please use a reference to remove the slow copy: std::vector<std::vector<double>>& vecCol = vec3[k - 1];
.
The initialization is slow due to the push_back
. Please use a reserve
before or a direct access to the values as you already know the size of your vectors.
Please, do not use types like std::vector<std::vector<double>>
for multidimensional arrays. This is not efficient because data are not stored contiguously in memory. Prefer using a huge flatten array/vector.
While getSum
seems sequential due to the possible dependency on out
, it is not. The compiler might not be able to vectorize the loop. You can use a #pragma omp simd reduction(+:out)
to speed up the loop.
With all of this put together, the code is 11 times faster on my 4-core machine and give correct result regarding the sequential implementation (from 1.781s for the initial sequential code to 0.160s for the optimized parallel code).