OpenMP Parallelizing not performing as expected

I have a code that looks like this:

void SimulationStep ( float *In, float *Out, float L, int N)
{
   
  Out[0] = In[0] - 2.0f*L*In[0] + L*In[N-1] + L*In[1];
    for (int x=1; x<N-1; x++)
    {
      Out[x] = In[x] - 2.0f*L*In[x] + L*In[x-1] + L*In[x+1]; 
    }
  Out[N-1] = In[N-1] - 2.0f*L*In[N-1] + L*In[N-2] + L*In[0];
}

I am trying to parallelize it. I've tried so many things and this is one example:

void SimulationStep ( float *In, float *Out, float L, int N)
{
   
  Out[0] = In[0] - 2.0f*L*In[0] + L*In[N-1] + L*In[1];
    #pragma omp parallel for 
    for (int x=1; x<N-1; x++)
    {
      Out[x] = In[x] - 2.0f*L*In[x] + L*In[x-1] + L*In[x+1]; 
    }
  Out[N-1] = In[N-1] - 2.0f*L*In[N-1] + L*In[N-2] + L*In[0];
}

The changes I've applied only gain a boost of .5 seconds, from 14s to 13,5s so I suspect that the code is not indeed parallelized. I think it might be a memory-bound bottleneck so I don't know what to do. Thank you in advance. Ps: I'm compiling with gcc/9.2.0 with -03 -fopenmp.

Solution

@Black_Alistar is implying Amdahl's Law, the amount of speed-up is limited by the proportion of the executable that can be parallelized. It looks your application is spending most of its time outside the simulation loop.

When I saw the marginal improvement in execution time, I immediately thought your application is I/O bounded:

If you store intermediate steps in your simulation then your application is likely to be I/O bounded and not CPU bounded. Therefore, using any parallel algorithm will make almost no difference;
Slow reading or writing could be because you are handling the data one item at a time;
Do not read or write in small blocks when using unbuffered I/O.