Search code examples
c++for-loopparallel-processingopenmp

OpenMP parallelize for construct performance


First method (parallelize inner loop):

for(j=0; j<LATTICE_VW; ++j) {
    x = j*DX + LATTICE_W;
    #pragma omp parallel for ordered private(y, prob)
        for(i=0; i<LATTICE_VH; ++i) {
            y = i*DY + LATTICE_S;
            prob = psi[i][j].norm();

            #pragma omp ordered
                out << x << " " << y << " " << prob << endl;
        }
}

Second method (parallelize outer loop):

#pragma omp parallel for ordered private(x, y, prob)
    for(j=0; j<LATTICE_VW; ++j) {
        x = j*DX + LATTICE_W;
        for(i=0; i<LATTICE_VH; ++i) {
            y = i*DY + LATTICE_S;
            prob = psi[i][j].norm();

            #pragma omp ordered
                out << x << " " << y << " " << prob << endl;
        }
    }

Third method (parallelize collapsed loops)

#pragma omp parallel for collapse(2) ordered private(x, y, prob)
    for(j=0; j<LATTICE_VW; ++j) {
        for(i=0; i<LATTICE_VH; ++i) {
            x = j*DX + LATTICE_W;
            y = i*DY + LATTICE_S;
            prob = psi[i][j].norm();

            #pragma omp ordered
                out << x << " " << y << " " << prob << endl;
        }
    }

If I was going to guess I would say that method 3 should be the fastest.

However method 1 is the fastest, while both the second and third take about the same ammount of time as if there was no parallelization. Why does this happens?


Solution

  • Look with this:

    for(int x = 0; x < 4; ++x)
      #pragma omp parallel for ordered
      for(int y = 0; y < 4; ++y)
        #pragma omp ordered
        cout << x << ',' << y << " (by thread " << omp_get_thread_num() << ')' << endl;
    

    you have:

    0,0 (by thread 0)
    0,1 (by thread 1)
    0,2 (by thread 2)
    0,3 (by thread 3)
    1,0 (by thread 0)
    1,1 (by thread 1)
    1,2 (by thread 2)
    1,3 (by thread 3)
    

    Each thread just has to wait for some cout all the work before can be done in parallel. But with:

    #pragma omp parallel for ordered
    for(int x = 0; x < 4; ++x)
      for(int y = 0; y < 4; ++y)
        #pragma omp ordered
        cout << x << ',' << y << " (by thread " << omp_get_thread_num() << ')' << endl;
    

    and

    #pragma omp parallel for collapse(2) ordered
    for(int x = 0; x < 4; ++x)
      for(int y = 0; y < 4; ++y)
        #pragma omp ordered
        cout << x << ',' << y << " (by thread " << omp_get_thread_num() << ')' << endl;
    

    the situations is:

    0,0 (by thread 0)
    0,1 (by thread 0)
    0,2 (by thread 0)
    0,3 (by thread 0)
    1,0 (by thread 1)
    1,1 (by thread 1)
    1,2 (by thread 1)
    1,3 (by thread 1)
    2,0 (by thread 2)
    2,1 (by thread 2)
    2,2 (by thread 2)
    2,3 (by thread 2)
    3,0 (by thread 3)
    3,1 (by thread 3)
    3,2 (by thread 3)
    3,3 (by thread 3)
    

    So thread 1 has to wait for thread 0 to finish all its work, before it can cout the first time, and nearly nothing can be done in parallel.

    Try adding schedule(static,1) to the collapse-version and it should perform at least as good as the first version does.