c++performance parallel-processing openmp slowdown

simple OpenMP parallel for loop slower than serial computation

I am new to parallelization, and I hope I'm not wasting anyone's time. I already asked a few friends that already used openMP, but they could not help me. So I guessed my case could be interesting for someone else too, at least for educational purposes, and I tried to document it as good as I could. These are two examples, one of them 100% taken from Tim Mattson's tutorials in youtube, the other one somehow simplified, but still kind of a standard approach I guess. In both cases the computation time scales with the number of threads for few iterations, but for a very large number of iterations the computation time seems to converge to the same number. This is of course wrong, since I would expect the computation time to be similar for few iterations, and really optimized for a large number of iterations.

Here the two examples, both compiled with

g++ -fopenmp main.cpp -o out

Thread model: posix gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04), on Ubuntu 14.04 and with the following header:

#include <omp.h> 
#include <stdio.h> 
#include <stdlib.h> 
#include <chrono>
#include <iostream>

using namespace std;


#define NUMBER_OF_THREADS 2
static long num_steps = 1000000000;

Now, the number of cores on the computer I'm working on right now is 8 (intel i7), so any number of threads between 2 and 4 I would have expected to bring some big advantage in terms of computation time.

Example 1:

int main() { 

omp_set_num_threads(NUMBER_OF_THREADS);
double step = 1.0/(double) num_steps, pi=0.0;

auto begin = chrono::high_resolution_clock::now();

#pragma omp parallel 
{ 
    int i, ID, nthrds;
    double x, sum = 0; 

    ID = omp_get_thread_num();
    nthrds = omp_get_num_threads();

    for (i=ID; i<num_steps; i=i+nthrds) { 
        x = (i+0.5)*step; 
        sum = sum + 4.0/(1.0+x*x); 
    } 

    #pragma omp critical
    pi += step*sum; 
} 

auto end = chrono::high_resolution_clock::now();
cout << chrono::duration_cast<chrono::nanoseconds>(end-begin).count()/1e6 << "ms\n";

return 0; 

}

Example 2:

int main() { 

    omp_set_num_threads(NUMBER_OF_THREADS);
    double pi=0, sum = 0; 
    const double step = 1.0/(double) num_steps;

    auto begin = chrono::high_resolution_clock::now();

    // #pragma omp parallel 
    { 
        #pragma omp parallel for reduction(+:sum)
        for (int i=0; i<num_steps; i++) { 
            double x = (i+0.5)*step; 
            sum += 4.0/(1.0+x*x); 
        } 
    } 

    pi += step*sum; 

    auto end = std::chrono::high_resolution_clock::now();
    cout << chrono::duration_cast<chrono::nanoseconds>(end-begin).count()/1e6 << "ms\n";

    return 0; 

}

Now, I thought at the beginning that example 2 is slowed down by the reduction of the variable, which disturbs the parallelization, but in the example 1 there is almost nothing shared. Let me know if I'm doing something really dumb, or if I can specify more aspects of the problem. Thanks to all.

Solution

As posted by gilles in the comments, the problem was that i was measuring time with clock(), which adds up all the tics of the cores. with

chrono::high_resolution_clock::now();

i get the expected speed-up.

for me the question is cleared, but maybe we can leave this as an example for future noobs like me to be referred to. If some mod believes otherwhise the post can be eliminated. Thanks again for the help