How to make specific parts serial within omp parallel?

Suppose I have the following code in which I have a serial version of doing computations and the attempted parallel version:

double getSum(std::vector<double>& xv, std::vector<double>& yv);

int main()
{
    int nrows = 600; int ncols = 200; int Lvec = 10000;
    std::vector<std::vector<std::vector<double>>> vec3;
    std::vector<std::vector<double>> xarr(nrows, std::vector<double>(ncols, 0));
    srand(time(NULL));

    for (int k = 1; k <= nrows; k++) {
        //vec3 is a 3d vector, where each component stores
        //2d vectors of variable sizes determined by ncols1, 
        //which ranges from 0-180
        int ncols1= (rand() % 10) *20;
        std::vector<std::vector<double>> vecCol;

        for (int j = 1; j <= ncols1; j++) {
            std::vector<double> vec0;
            for (int m = 1; m <= Lvec; m++) {
                vec0.push_back((rand() % 10) / 10.0);
            }
            vecCol.push_back(vec0);
        }
        vec3.push_back(vecCol);
    }


    //serial version
    std::vector<std::vector<double>> xarrSerial(nrows, std::vector<double>(ncols, 0));
    double xnow, xprev;
    double t0 = -omp_get_wtime();
    for (int k = 1; k <= nrows; k++) {
        std::vector<std::vector<double>> vecCol = vec3[k - 1];
        for (int j = 1; j <= vecCol.size(); j++) {
            if (j == 1) {
                xprev = 0;
                xarrSerial[k - 1][j - 1] = xprev;
            }
            else {
                xnow = getSum(vec3[k - 1][j - 2], vec3[k - 1][j - 1]);
                if (xnow > xprev) {
                    xarrSerial[k - 1][j - 1] = xnow;
                }
                else {
                    xarrSerial[k - 1][j - 1] = -1 * xnow;
                }
                xprev = xnow;
            }
        }
    }

    t0 += omp_get_wtime();


    //parallel version
    xprev=0; xnow=0;
    double t = -omp_get_wtime();
    #pragma omp parallel for
    for (int k = 1; k <= nrows; k++) {
        std::vector<std::vector<double>> vecCol = vec3[k - 1];
        for (int j = 1; j <= vecCol.size(); j++) {
            if (j == 1) {
                xprev = 0;
                xarr[k - 1][j - 1] = xprev;
            }
            else {
                //add vec3[k - 1][j - 2] and vec3[k - 1][j - 1]
                //then compare with xprev
                xnow = getSum(vec3[k - 1][j - 2], vec3[k - 1][j - 1]);
                if (xnow > xprev) {
                    xarr[k - 1][j - 1] = xnow;
                }
                else {
                    xarr[k - 1][j - 1] = -1 * xnow;
                }
                xprev = xnow;
            }
        }
    }

    t += omp_get_wtime();

    std::cout << "xarrSerial\n";
    for (int k = 1; k <= 10; k++) {
        for (int j = 1; j <= 4; j++) {
            std::cout << xarrSerial[k - 1][j - 1] << ", ";
        }
        std::cout << "\n";
    }

    std::cout << "xarr\n";
    for (int k = 1; k <= 10; k++) {
        for (int j = 1; j <= 4; j++) {
            std::cout << xarr[k - 1][j - 1] << ", ";
        }
        std::cout << "\n";
    }
    std::cout << "\n";


    std::cout << "t0: " << t0 << std::endl;
    std::cout << "t: " << t << std::endl;

    return 0;
}

double getSum(std::vector<double>& xv, std::vector<double>& yv)
{
    double out=0.0;
    for (int i = 0; i < xv.size(); i++) {
        out = xv[i]*yv[i]+out;
    }
    return out;
}

For the parallel version, I can see how #pragma omp parallel for isn't used correctly because each computation depends on the previous one in the iteration as can be seen in:

xnow = getSum(vec3[k - 1][j - 2], vec3[k - 1][j - 1]);
if (xnow > xprev) {
    xarr[k - 1][j - 1] = xnow;
}
else {
    xarr[k - 1][j - 1] = -1 * xnow;
}
xprev = xnow;

I confirmed that the parallel version is not correctly. Although random values are used in rand(), for one example, the output I got was

xarrSerial (Serial version):

0 2047.63 -2040.89 -2018.98
0 2004.31 2031.86 2058.08
...

and in the parallel version, xarr returned

0 2047.63 -2040.89 -2018.98
0 -2004.31 2031.86 2058.08

I want xarr to be the same as xarrSerial, but the negative sign is clearly wrong in -2004.31

Also, the parallel version is not noticeably faster than the serial version, as the serial version took 2.78sec but the parallel version took 2.54sec, and my computer has 40 threads

What is the correct way to use OpenMP to parallelize this? Or can this not be parallelized with OpenMP because of the if (xnow > xprev)?

Solution

TL;DR: protect the variables xprev and xnow with a firstprivate cause or use local variables.

At first glance, the loops seem inherently sequential due to the inter-iterations dependency on xprev and xnow between iterations. But if we look carefully, there is not actually inter-iterations dependencies since xprev is initialized within the inner-loop. However, when you put a #pragma omp parallel for, you tell to the compiler that there is no dependencies between the loop iterations: the code writer is responsible for ensuring that so that OpenMP can generate a correct code. By default, variables outside the #pragma omp parallel for are considered shared among the threads. You have to tell to OpenMP explicitly that it is not the case here using the cause firstprivate(xprev, xnow). The best practice in programming is to reduce the scope of variables to improve the readability of the code and as well as to be able to better track dependencies. So, please move xprev and xnow in the loops.

Advise

The line std::vector<std::vector<double>> vecCol = vec3[k - 1]; slows down the code a lot because it involves a deep copy. Please use a reference to remove the slow copy: std::vector<std::vector<double>>& vecCol = vec3[k - 1];.

The initialization is slow due to the push_back. Please use a reserve before or a direct access to the values as you already know the size of your vectors.

Please, do not use types like std::vector<std::vector<double>> for multidimensional arrays. This is not efficient because data are not stored contiguously in memory. Prefer using a huge flatten array/vector.

While getSum seems sequential due to the possible dependency on out, it is not. The compiler might not be able to vectorize the loop. You can use a #pragma omp simd reduction(+:out) to speed up the loop.

With all of this put together, the code is 11 times faster on my 4-core machine and give correct result regarding the sequential implementation (from 1.781s for the initial sequential code to 0.160s for the optimized parallel code).