c parallel-processing mpi openmp distributed-computing

No speedup with OpenMp and MPI over MPI only

I have read all the relevant questions I found, but still I could not find a solution to my issue, where I have a function, with a double for loop, that is the bottleneck of my program.

The code is designed wrt MPI:

Having a big matrix which I scatter among p processes.
Every process now has a submatrix.
Every process is calling update() in a loop.
When the loop is terminated, master process gathers results.

Now I would like to augment my MPI code with OpenMp to get faster execution, by taking advantage of the double for loop of update().

void update (int pSqrt, int id, int subN, float** gridPtr, float ** gridNextPtr)
{
        int i = 1, j = 1, end_i = subN - 1, end_j = subN - 1;
        if ( id / pSqrt == 0) {
                i = 2;
                end_i = subN - 1;
        } else if ( id / pSqrt == (pSqrt - 1) ) {
                i = 1;
                end_i = subN - 2;
        }
        #pragma omp parallel for
        for ( ; i < end_i; ++i) {
                if (id % pSqrt == 0) {
                        j = 2;
                        end_j = subN - 1; 
                } else if ((id + 1) % pSqrt == 0) {
                        j = 1;
                        end_j = subN - 2;
                }
                #pragma omp parallel for
                for ( ; j < end_j; ++j) {
                        gridNextPtr[i][j] = gridPtr[i][j]  +
                          parms.cx * (gridPtr[i+1][j] +
                          gridPtr[i-1][j] -
                          2.0 * gridPtr[i][j]) +
                          parms.cy * (gridPtr[i][j+1] +
                          gridPtr[i][j-1] -
                          2.0 * gridPtr[i][j]);
                }
        }
}

I am running this on 2 computers, where every computer has 2 CPUs. I am using 4 processes. However, I see no speedup with and without OpenMp. Any ideas please? I am compiling with -O1 optimization flag.

Solution

What about this version (not tested)?

Please compile it and test it. If it works better, I'll explain it more. BTW, using some more aggressive compiler options might help too.

void update (int pSqrt, int id, int subN, float** gridPtr, float ** gridNextPtr)
{
    int beg_i = 1, beg_j = 1;
    int end_i = subN - 1, end_j = subN - 1;
    if ( id / pSqrt == 0 ) {
        beg_i = 2;
    } else if ( id / pSqrt == (pSqrt - 1) ) {
        end_i = subN - 2;
    }
    if (id % pSqrt == 0) {
        beg_j = 2;
    } else if ((id + 1) % pSqrt == 0) {
        end_j = subN - 2;
    }
    #pragma omp parallel for schedule(static)
    for ( int i = beg_i; i < end_i; ++i ) {
        #pragma omp simd
        for ( int j = beg_j; j < end_j; ++j ) {
            gridNextPtr[i][j] = gridPtr[i][j] +
                parms.cx * (gridPtr[i+1][j] +
                        gridPtr[i-1][j] -
                        2.0 * gridPtr[i][j]) +
                parms.cy * (gridPtr[i][j+1] +
                        gridPtr[i][j-1] -
                        2.0 * gridPtr[i][j]);
        }
    }
}

EDIT: Some explanations on what I did to the code...

The initial version was using nested parallelism for no reason at all (a parrallel region nested within another parallel region). This was likely to be very counter effective and I simply removed it.
The loop indexes i and j were declared and initialised outside of the for loop statements. This is error-prone at two level: 1/ it might forces to declare their parallel scope (private) whereas having them within the for statement automatically give them the right one; and 2/ you can have mix-up by erroneously reusing the indexes outside of the loops. Moving them into the for statement was easy.
You were changing the boundaries of the j loops inside the parallel region for no good reason. You would have had to declare end_j private. Moreover, it was a potential limitation for further developments (like potential use of the collapse(2) directive), as it was breaking the rules for a canonical loop form as defined in the OpenMP standard. So defining some beg_i and beg_j outside of the parallel region was making sense, sparing computations and simplifying the form of the loops, keeping them canonical.

Now from there, the code was suitable for vectorisation, and the addition of a simple simd directive on the j loop would enforce it, should the compiler fail to see the possible vectorisation by itself.