Search code examples
c++performancex86openmpbenchmarking

OpenMP slows down unrelated serial loop


I have two unrelated for loops, one is executed serial and one is executed with an OpenMP parallel for construct.

The next serial code becomes slower the more OpenMP-Threads I'm using.

class Foo {
public:
    Foo(size_t size) {
        parallel_vector.resize(size, 0.0);
        serial_vector.resize(size, 0.0);
    }

    void do_serial_work() {
        std::mt19937 random_number_generator;
        std::uniform_real_distribution<double> random_number_distribution{ 0.0, 1.0 };

        for (size_t i = 0; i < serial_vector.size(); i++) {
            serial_vector[i] = random_number_distribution(random_number_generator);
        }
    }

    void do_parallel_work() {
#pragma omp parallel for
        for (auto i = 0; i < parallel_vector.size(); ++i) {
            for (auto integration_steps = 0; integration_steps < 30; integration_steps++) {
                parallel_vector[i] += (0.05 - parallel_vector[i]) / 30.0;
            }
        }
    }

private:
    std::vector<double> parallel_vector;
    std::vector<double> serial_vector;
};

void test_with_size(size_t size, int num_threads) {
    std::cout << "Testing with " << num_threads << " and size: " << size << "\n";
    omp_set_num_threads(num_threads);

    Foo foo{ size };

    long long total_dur_1 = 0;
    long long total_dur_2 = 0;

    for (auto i = 0; i < 500; i++) {
        const auto tp_1 = std::chrono::high_resolution_clock::now();
        foo.do_serial_work();
        
        const auto tp_2 = std::chrono::high_resolution_clock::now();
        foo.do_parallel_work();

        const auto tp_3 = std::chrono::high_resolution_clock::now();
        const auto dur_1 = std::chrono::duration_cast<std::chrono::microseconds>(tp_2 - tp_1).count();
        const auto dur_2 = std::chrono::duration_cast<std::chrono::microseconds>(tp_3 - tp_2).count();

        total_dur_1 += dur_1;
        total_dur_2 += dur_2;
    }

    std::cout << total_dur_1 << "\t" << total_dur_2 << "\n";
}

int main(int argc, char** argv) {
    test_with_size(100000, 1);
    test_with_size(100000, 2);
    test_with_size(100000, 4);
    test_with_size(100000, 8);

    return 0;
}

The slowdown happens on my local machine, a Win10 Laptop having an Intel Core i7-7700 with 4 cores and hyperthreading, 24 GB of RAM. The compiler is the latest in VisualStudio 2019. Compiled in RelWithDebugMode (from CMake, include /O2 and /openmp).

It does not happen when I use a stronger machine, a CentOS 8 with 2x Intel Xeon Platinum 9242 with 48 cores each, no hyperthreading, 769 GB of RAM. The compiler is gcc/8.3.1. Compiled with g++ --std=c++17 -O3 -fopenmp.

Timings on Win10 i7-7700:

Testing with 1 and size: 100000
3043846 10536315
Testing with 2 and size: 100000
3276611 5350204
Testing with 4 and size: 100000
3937311 2735655
Testing with 8 and size: 100000
5002727 1598775

and on CentOS 8, 2x Xeon Platinum 9242:

Testing with 1 and size: 100000
727756  4111363
Testing with 2 and size: 100000
731649  2069257
Testing with 4 and size: 100000
734019  1056157
Testing with 8 and size: 100000
752584  544373

So my initial thought was "There's too much pressure on the cache". However, when I removed virtually everything from the parallel section but the loop, the slowdown happened again.


Updated parallel section with the work taken out:

    void do_parallel_work() {
#pragma omp parallel for
        for (auto i = 0; i < 8; ++i) {
            //for (auto integration_steps = 0; integration_steps < 30; integration_steps++) {
            //    parallel_vector[i] += (0.05 - parallel_vector[i]) / 30.0;
            //}
        }
    }

Timings on Win10 with updated parallel section:

Testing with 1 and size: 100000
3206293 636
Testing with 2 and size: 100000
3218667 2672
Testing with 4 and size: 100000
3928818 8689
Testing with 8 and size: 100000
5106605 10797

Looking into the OpenMP 2.0 standard (VS does only support 2.0) (find it here: https://www.openmp.org/specifications/), it says in 2.7.2.5 lines 7,8:

In the absence of an explicit default clause, the default behavior is the same as if the default(shared) were specified.

And in 2.7.2.4 line 30:

All threads within the team access the same storage area for shared variables.

For me, this rules out that the OpenMP threads each copy serial_vector, which was the last explanation I could think of.

I'm happy for any explanation/ discussion on that matter, even if I just plainly missed something.

EDIT:

Out of curiosities sake, I also tested on my Win10 machine with the WSL. Runs gcc/9.3.0, and the timings are:

Testing with 1 and size: 100000
833678  2752
Testing with 2 and size: 100000
762877  1863
Testing with 4 and size: 100000
816440  1860
Testing with 8 and size: 100000
991184  2350

I'm honestly not sure why the windows executable takes so much longer an the same machine as the linux one (optimization /O2 for VC++ is max), but funnily enough, the same artifacts don't happen here.


Solution

  • OpenMP on Windows by default has 200ms spinlocks. It means when you leave the omp block then all omp worker threads are spinning waiting for new work. It has benefit if you have many omp blocks next to each other. In your case the threads just consume CPU power.

    To disable/control the spinlocks you have several options:

    1. Define environment variable OMP_WAIT_POLICY and set it to PASSIVE to disable spinlocs completely,
    2. Switch to Intel OMP Runtime shipped with OneAPI. Then you can fully control the spin lock time by defining KMP_BLOCKTIME environment variable ,
    3. Install Visual Studio 2019 Preview (soon should be in the official release) and use llvm omp. Then you can also control spinlock time by defining KMP_BLOCKTIME environment variable.