algorithm performance optimization parallel-processing parallelism-amdahl

Why Amdahl Law on serial and parallel fractions does not provide a theoretical speedup of 4 on quad-core CPU?

I have a code
( the Floyd-Warshall algorithm for the shortest path in an NxN matrix ),
with three for-loops, one within the other and with the same number of cycles.

In the last for I have an assignment via a ternary-operation = <bool> ? <val1> : <val2> - based on a comparison and if it is True or not.

I used an OpenMP to parallelize the second for with a #pragma omp parallel for.

I can't compute the parallel percentage and serial percentage of the code to successfully apply the Amdahl Law to recover the theoretical speedup.

  for (k = 0; k < N; k++)
    #pragma omp parallel for  private(i,j) shared(x)  num_threads(4)
    for (i = 0; i < N; i++){
        for (j = 0; j < N; j++) {
            x[i][j] =  x[i][j] < (x[i][k] + x[k][j]) ? x[i][j] : (x[i][k] + x[k][j])  ;
        }
    }

I'm using four cores, so I expect a theoretical speedup of 4.

X[i][j] is the matrix, where every element acts as the weight of the edge which connects nodes i and j; it is the macro INF ( infinite ) if they're not connected.

Solution

The two inner loops and the body of the innermost loop are executed in serial on each core.That's because you marked the outer loop to be executed in parallel.

But:

I would expect far less than a speedup of 4. There is always a communication-overhead
The body in the innermost loop uses the same matrix for all cores and also modifies the matrix. therefore the changes in the matrix must be propageted to all other cores. This might in the lead to the following problems:
1. CPU-caches might be useless because the cached array-elements might be changed by another core which might have a different cache.
2. In worst case all modifications in the matrix depend on the previous change (I haven't check this for your case). In this case no parallel execution is possible => no speedup at all for more than one core.

You should check if it is possible to change your algorithm in a way that not intersecting partial matrixes can be processed. You will get the best speedup if the cores work on separate not intersecting data.

Since there is nearly no effort in doing so you should definitively profile the code and variants of it.