c++optimization parallel-processing openmp ppl

Find max element in array OpenMP and PPL versions run much slower than serial code

I'm trying to implement two versions of a function that would find the max element in the array of floats. However, my parallel functions appeared to run much slower than the serial code.

With array of 4194304 (2048 * 2048) floats, I get the following numbers (in microseconds):

serial code: 9433
PPL code: 24184 (more than two times slower)
OpenMP code: 862093 (almost 100 times slower)

Here's the code:

PPL:

float find_largest_element_in_matrix_PPL(float* m, size_t dims)
{
    float max_element;
    int row, col;
    concurrency::combinable<float> locals([] { return (float)INT_MIN; });
    concurrency::parallel_for(size_t(0), dims * dims, [&locals](int curr)
    {
        float &localMax = locals.local();
        localMax = max<float>(localMax, curr);
    });

    max_element = locals.combine([](float left, float right) { return max<float>(left, right); });
    return max_element;
}

OpenMP:

float find_largest_element_in_matrix_OMP(float* m, unsigned const int dims)
{
    float max_value = 0.0;
    int i, row, col, index;

    #pragma omp parallel for private(i) shared(max_value, index)
    for (i = 0; i < dims * dims; ++i)
    {
#pragma omp critical
        if (m[i] > max_value)
        {
            max_value = m[i];
            index = i;
        }
    }

    //row = index / dims;
    //col = index % dims;
    return max_value;
}

What's making the code run so slowly? Am I missing something?
Could you help me find out what I'm doing wrong?

Solution

So, as Baum mit Augen noticed, the problem with OpenMP was that I had a critical section and the code didn't actually run in parallel, but synchronously. Removing critical section did the trick.

As for PPL, I've found out that it does a lot more preparations (creating threads and stuff) than OpenMP does, hence the slowdown.

Update

So, here's the correct variant to find max element with OpenMP (the critical section is still needed but inside the if block):

float find_largest_element_in_matrix_OMP(float* m, unsigned const int dims)
{
    float max_value = 0.0;
    int i, row, col, index;

    #pragma omp parallel for
    for (i = 0; i < dims * dims; ++i)
    {
        if (m[i] > max_value)
        {
            #pragma omp critical
            max_value = m[i];
        }
    }
    return max_value;
}

PS: not tested.