OpenMP poor performance with arrays

I have the following issue: I am trying to parallelize a very simple PDE solver in c++ with openMP but the performance does not improve if I increase the number of threads. The equation is a simple 1D heat equation with convection. Since I need the solution at each timestep I have decided to work with a 2D array

double solution[iterationsTime][numPoints];

In which each row contains the discretized function at a specific time step. The update is done via a for loop

#pragma omp parallel default(shared) private(t, i, iBefore, iAfter)
{
for(t=0; t<iterationsTime; t++)

#pragma omp for schedule(auto) 
   for(i=0; i<numPoints; i++) {
       iBefore = (i==0)?numPoints-2:i-1;
       iAfter = (i==numPoints-1)?1:i;
       solution[t+1][i] = solution[t][iAfter] - solution[t][iBefore];
}

The values iBefore and iAfter are used because I am essentialy treating the array as a ring buffer so the PDE has periodic boundary conditions and the domain is treated as a ring. Anyways, each update on solution[t+1] requires some computations on solution[t] like the one shown in the code above. I understand that the cause of the poor scalability is most likely false sharing so I have transformed the 2D matrix into a 3D matrix

double solution[iterationsTime][numPoints][PAD];

This allows me to ensure that no write operation is performed on a shared cache line as I can vary the size of PAD. The code changes a bit since now every value will be stored in

solution[t][i][0];

And the following one in

solution[t][i+1][0];

Note that the memory required is allocated on the heap using the new operator outside the parallel region. The code works very well but it is not scalable. I have tried different schedules like static, dynamic, auto, ... I compile it with

g++ code.cpp -fopenmp -march=native -O3 -o out

I have tried to remove or add the -march and -O3 flags but I do not see any improvement. I have tried different sizes of PAD and environment variables like OMP_PROC_BIND but no improvement. I have no idea what is causing the loss in performance at this point. This is the code

const int NX = 500; //DOMAIN DISCRETIZATION
const int PAD = 8; //PADDING TO AVOID FALSE SHARING
const double DX = 1.0/(NX-1.0); //STEP IN SPACE
const double DT = 0.01*DX; //STEP IN TIME
const int NT = 1000; //MAX TIME ITERATIONS
const double C = 10.0; //CONVECTION VELOCITY
const double K = 0.01; //DIFFUSION COEFFICIENT

int main(int argc, char **argv) {
  omp_set_num_threads(std::atoi(argv[1]));  //SET THE REQUIRED NUMBER OF THREADS

  //INTIALING MEMORY --> USING STD::VECTOR INSTEAD OF DOUBLE***
  std::vector<std::vector<std::vector<double>>> solution(NT, std::vector<std::vector<double>>(NX, std::vector<double>(PAD,0)));
  for (int i=0; i<NX; i++){
    solution[0][i][0] = std::sin(i*DX*2*M_PI); //INITIAL CONDITION
  }

  int numThreads, i, t, iBefore, iAfter;
  double energy[NT]{0.0}; //ENERGY of the solution --> e(t)= integral from 0 to 1 of ||u(x,t)||^2 dx

  //SOLVE THE PDE ON A RING
  double start = omp_get_wtime();
  #pragma omp parallel default(none) shared(solution, energy, numThreads, std::cout) private(i, t, iBefore, iAfter)
  {
    #pragma omp master
    numThreads = omp_get_num_threads();

    for(t=0; t<NT-1; t++){

      #pragma omp for schedule(static, 8) nowait
      for(i=0; i<NX; i++){
        iBefore = (i==0)?NX-2:i-1;
        iAfter = (i==NX-1)?1:i+1;
        solution[t+1][i][0]=solution[t][i][0] 
          + DT*(  -C*((solution[t][iAfter][0]-solution[t][iBefore][0])/(2*DX))
            + K*(solution[t][iAfter][0]-2*solution[t][i][0]+ solution[t][iBefore][0])/(DX*DX)  );
      }

      // COMPUTE THE ENERGY OF PREVOIUS TIME ITERATION
      #pragma omp for schedule(auto) reduction(+:energy[t])
      for(i=0; i<NX; i++) {
        energy[t] += DX*solution[t][i][0]*solution[t][i][0];
      }
    }
  }
  std::cout << "numThreads: " <<numThreads << ". Elapsed Time: "<<(omp_get_wtime()-start)*1000 << std::endl;
  return 0;
}

And the performances

numThreads: 1. Elapsed Time: 9.65456
numThreads: 2. Elapsed Time: 9.1855
numThreads: 3. Elapsed Time: 9.85965
numThreads: 4. Elapsed Time: 8.9077
numThreads: 5. Elapsed Time: 15.5986
numThreads: 6. Elapsed Time: 15.5627
numThreads: 7. Elapsed Time: 16.204
numThreads: 8. Elapsed Time: 17.5612

Solution

Analysis

First of all, you are working on a too small granularity for the multi-threading to be very efficient. Indeed, your sequential time is 9.6 ms and there is 999 time-step. As a result, each time-step take roughly 9.6 us which is rather small.

Additionally, memory accesses are not efficient:

On one hand using std::vector<std::vector<std::vector<double>>> produces internally an array that contains pointers to arrays that contains pointers to double-precision arrays (all dynamically allocated). Arrays will probably not be contiguous in memory and could also be badly aligned. This can significantly slow down the execution as it is more difficult for the processor to prefetch data from memory. Consider allocating one big array rather than many small one (eg. one big flatten std::vector).
On the other hand, using padding the way you do result in a very inefficient memory access pattern. Indeed, you only use 1 double over 8, so 7/8 of the memory usage is wasted (probably more since std::vector can allocate additional memory). Additionally, the one read/written are not contiguous due to the added padding and it is hard for the processor to prefetch data and also use the memory efficiently (since reads/writes are performed per the cache line, ie. multiple double-precision scalars). Consider applying padding rows or chunks of your matrix (not scalar).

Finally, using a schedule with blocks of size 8 seems too small. Specifying just schedule(static) should be probably better here for both the parallel-for and the reduction (the schedule should be the same and static for both if you are using nowait and you want correct results I think).

Consequently, you are probably measuring latency and memory overheads.

Improved version

Here is the corrected code with the most important fixes (false-sharing effects are ignored):

#include <iostream>
#include <vector>
#include <cmath>
#include <omp.h>

const int NX = 500; //DOMAIN DISCRETIZATION
const int PAD = 8; //PADDING TO AVOID FALSE SHARING
const double DX = 1.0/(NX-1.0); //STEP IN SPACE
const double DT = 0.01*DX; //STEP IN TIME
const int NT = 1000; //MAX TIME ITERATIONS
const double C = 10.0; //CONVECTION VELOCITY
const double K = 0.01; //DIFFUSION COEFFICIENT

int main(int argc, char **argv) {
  omp_set_num_threads(std::atoi(argv[1]));  //SET THE REQUIRED NUMBER OF THREADS

  //INTIALING MEMORY --> USING A FLATTEN DOUBLE ARRAY
  std::vector<double> solution(NT * NX);
  for (int i=0; i<NX; i++){
    solution[0*NX+i] = std::sin(i*DX*2*M_PI); //INITIAL CONDITION
  }

  int numThreads, i, t, iBefore, iAfter;
  double energy[NT]{0.0}; //ENERGY of the solution --> e(t)= integral from 0 to 1 of ||u(x,t)||^2 dx

  //SOLVE THE PDE ON A RING
  double start = omp_get_wtime();
  #pragma omp parallel default(none) shared(solution, energy, numThreads, std::cout) private(i, t, iBefore, iAfter)
  {
    #pragma omp master
    numThreads = omp_get_num_threads();

    for(t=0; t<NT-1; t++){

      #pragma omp for schedule(static) nowait
      for(i=0; i<NX; i++){
        iBefore = (i==0)?NX-2:i-1;
        iAfter = (i==NX-1)?1:i+1;
        solution[(t+1)*NX+i]=solution[t*NX+i]
          + DT*(  -C*((solution[t*NX+iAfter]-solution[t*NX+iBefore])/(2*DX))
            + K*(solution[t*NX+iAfter]-2*solution[t*NX+i]+ solution[t*NX+iBefore])/(DX*DX)  );
      }

      // COMPUTE THE ENERGY OF PREVOIUS TIME ITERATION
      #pragma omp for schedule(static) reduction(+:energy[t])
      for(i=0; i<NX; i++) {
        energy[t] += DX*solution[t*NX+i]*solution[t*NX+i];
      }
    }
  }
  std::cout << "numThreads: " <<numThreads << ". Elapsed Time: "<<(omp_get_wtime()-start)*1000 << std::endl;
  return 0;
}

Results

On my machine with 6 cores (Intel i5-9600KF). I get the following results.

Before:

1 thread : 3.35 ms
2 threads: 2.90 ms
3 threads: 2.89 ms
4 threads: 2.83 ms
5 threads: 3.07 ms
6 threads: 2.90 ms

After:

1 thread : 1.62 ms
2 threads: 1.03 ms
3 threads: 0.87 ms
4 threads: 0.95 ms
5 threads: 1.00 ms
6 threads: 1.16 ms

With the new version, the sequential time is much faster and it succeeds to scale well up to 3 cores. Then the synchronization overheads become significant and slow the overall execution down (note that each time-step last for less than 1 us here which is very small).