Low performance of multicore calculations on Linux (openMP, boost::thread, etc)

I want to use multicore calculations in my applications. I begin to develop sample application with openMP (C++).

When I start it I found that my multicore calculation no faster than serial (even multicore calculations are slower than serial calculations in some cases):

./openmp_test 

Serial. Sum: 1.77544e+08 time: 21.84

Reduction, 2 threads. Sum: 1.77544e+08 time: 21.65

Two sections. Sum: 1.77544e+08 time: 60.65

My next think was to create boost::thread application for testing two threads on cores of CPU. Results:

./boost_thread_test 

Serial. Sum: 1.42146e+09 time: 179.64

Two boost threads. Sum: 1.42146e+09 time: 493.34

I use openSuSe (x64) powered laptop with Core i3 CPU inside.

Why so bad performance of multithreading I have?

Solution

Both your codes, the one based on OpenMP sections and the one based on boost::thread, are likely victims of false sharing. False sharing occurs because temporal loads and stores operate on whole cache lines rather than directly on their operands. For example, the following statement:

sum = sum + value;

results in no only the value of sum being read from memory, updated and then written back, but rather a whole small section of memory, known as cache line being read and then written back. The cache line on modern x86 CPUs is usually of the order of 64 bytes, which means that not only the value of sum would be loaded from/stored to memory, but also 56 bytes around it. Cache lines also always start at addresses that are multiples of 64. What are the implications to your code?

In the OpenMP sections code you have:

double sum1;
double sum2;

...
// one section operates on sum1
...
// one section operates on sum2
...

sum1 and sum2 are located on the stack of the parent function omp_sections (a side note - the omp_ prefix is reserved for functions in the OpenMP runtime library; do not use it to name your own functions!). Being doubles, sum1 and sum2 are aligned on 8-byte boundary and in total take 16 bytes. The probability that they both would fall within the same cache line is 7/8 or 87.5%. What happens when the first thread wants to update sum1 is the following:

It reads the cache line that holds sum1
It updates the value of sum1
It notifies all other cores that the content of the cache line has changed so that they must invalidate it in their caches

The last part is very crucial - it is part of what is known as cache coherence. Since sum1 and sum2 are likely to fall in the same cache line, the core that executes the seconds thread has to invalidate its cache and reload it from a lower memory hierarchy level (e.g. from the shared last-level cache or from the main memory). Exactly the same happens when the second thread modifies the value of sum2.

One possible solution is to use the reduction clause as in the case where you use the OpenMP worksharing directive for:

double sum;

#pragma omp parallel sections reduction(+:sum) num_threads(2)
{
   ...
}

Another possible solution is to insert some padding between the two values as to make them more than one cache line apart:

double sum1;
char pad[64];
double sum2;

I don't know if the C++ standard gives any guarantee how local variables are to be laid on the stack, i.e. there might not be a guarantee that the compiler would not "optimise" the placement of the variables and would not reorder them like sum1, sum2, pad. If so they can be placed in a structure.

The problem is essentially the same with your threads case. The class data members take:

double *a;   // 4 bytes on x86, 8 bytes on x64
int niter;   // 4 bytes
int start;   // 4 bytes
int end;     // 4 bytes
// 4 bytes padding on x64 because doubles must be aligned
double sum;  // 8 bytes

The class data members take 24 bytes on x86 and 32 bytes on x64 (x86 in 64-bit mode). This means that two class instances can fit in the same cache line or are likely to share one. Again, you might add a padding data member after sum of at least 32 bytes in size:

class Calc
{
private:
    double *a;
    int niter;
    int start;
    int end;
    double sum;
    char pad[32];
...
};

Note that private variables, including the implicit private copies created by the reduction clause, likely reside on the individual threads' stack and hence are much more than one cache line apart, hence no false sharing occurs and the code runs faster in parallel.

Edit: I forgot to mention that most compilers remove unused variables during the optimisation phase. In the case with the OpenMP sections the padding is mostly optimised out. This could be cured by applying alignment attributes instead (warning: possibly GCC specific):

double sum1 __attribute__((aligned(64))) = 0;
double sum2 __attribute__((aligned(64))) = 0;

Although this removes the false sharing, it still prevents most compilers from using register optimisations because sum1 and sum2 are shared variables. Hence it will still be slower than the version that uses reduction. On my test system aligning both variables on a cache line boundary reduces the execution time from 56 seconds to 30 seconds, given a serial execution time of 20 seconds. This only has to show that sometimes OpenMP constructs spoil some compiler optimisations and the parallel code might run much slower that the serial one so one has to be careful.

You can make both variables lastprivate and this would allow the compiler to perform register optimisations on them:

#pragma omp parallel sections num_threads(2) lastprivate(sum1,sum2)

With this modification the sections code run just as fast as the one with the worksharing directive. Another possible solution is to accumulate to local variables and assign to sum1 and sum2 after the loops have finished:

#pragma omp section
{
    double s = 0;
    for (int i = 0; i < niter / 2; i++)
    {
        for (int j = 0; j < niter; j++)
        {
            for (int k = 0; k < niter; k++)
            {
                double x = sin(a[i]) * cos(a[j]) * sin(a[k]);
                s += x;
            }
        }
    }
    sum1 = s;
}
// Same for the other section

This one is essentially equivalent to threadprivate(sum1).

Unfortunately I don't have boost installed so I cannot test your threaded code. Try to execute the whole calculation using Calc::run() so to see what implications have using a C++ class on the speed.