Search code examples
performanceparallel-processingopenmpcompiler-optimizationgcc10

Efficiency of OpenMP vs optimisation levels


I am new to openmp, but I have been puzzled by this for a few days and couldn't find any answer online. Hopefully someone here can explain this strange phenomenon to me.

I wanted to compare runtimes between a sequential and a parallel version of the same program. The parallel version runs much faster than the sequential one (~5x) when I compile them (on gcc-10) with -O or above (but the differences between the different levels are quite small).

However, this is not the case when I compile both programs using -O0. In fact, when computing both versions with -O0, the sequential version is even slightly faster. I tried to understand if some of the optimisations enabled only in O1 and above were having a substantial effect, but without luck.

For the record, compiling with -Os is better than -O0 but far less efficient than -O1 and above.

Did anyone notice something similar? Is there an explanation for this?

Thanks!

====

Here are links to the c files: sequential code, parallel code


Solution

  • The core of all your loops is something like:

    var += something;
    

    In the sequential code, each var is a local stack variable and with -O0 the line compiles to:

    ; Compute something and place it in RAX
    ADD QWORD PTR [RBP-vvv], RAX
    

    Here vvv is the offset of var in the stack frame rooted at the address stored in RBP.

    With OpenMP, certain transformations of the source code take place and the same expression becomes:

    *(omp_data->var) = *(omp_data->var) + something;
    

    where omp_data is a pointer to a structure holding pointers to the shared variables used in the parallel region. This compiles to:

    ; Compute something and store it in RAX
    MOV RDX, QWORD PTR [RBP-ooo]  ; Fetch omp_data pointer
    MOV RDX, QWORD PTR [RDX]      ; Fetch *(omp_data->var)
    ADD RDX, RAX
    MOV RAX, QWORD PTR [RBP-ooo]  ; Fetch omp_data pointer
    MOV QWORD PTR [RAX], RDX      ; Assign to *(omp_data->var)
    

    This is the first reason the parallel code is slower - the simple action of incrementing var involves more memory accesses.

    The second, and actually stronger reason is the false sharing. You have 8 shared accumulators: xa, xb, etc. Each is 8 bytes long and aligned in memory for a total of 64 bytes. Given how most compilers place such variables in memory, they most likely end up next to each other in the same cache line or in two cache lines (a cache line on x86-64 is 64 bytes long and is read and written as a single unit). When one thread writes to its accumulator, e.g., thread 0 updates xa, this invalidates the cache of all other threads whose accumulators happen to be in the same cache line and they need to re-read the value from an upper level cache or even the main memory. This is bad. This is so bad, that the slowdown it causes is way worse than having to access the accumulators through double pointer indirection.

    What does -O1 change? It introduces register optimisation:

    register r = *(omp_data->var);
    for (a = ...) {
       r += something;
    }
    *(omp_data->var) = r;
    

    Despite var being a shared variable, OpenMP allows for temporarily divergent memory views in each thread. This allows the compiler to perform register optimisation, in which the value of var does not change for the duration of the loop.

    The solution is to simply make all xa, xb, etc. private.