I have two unrelated for
loops, one is executed serial and one is executed with an OpenMP parallel for construct.
The next serial code becomes slower the more OpenMP-Threads I'm using.
class Foo {
public:
Foo(size_t size) {
parallel_vector.resize(size, 0.0);
serial_vector.resize(size, 0.0);
}
void do_serial_work() {
std::mt19937 random_number_generator;
std::uniform_real_distribution<double> random_number_distribution{ 0.0, 1.0 };
for (size_t i = 0; i < serial_vector.size(); i++) {
serial_vector[i] = random_number_distribution(random_number_generator);
}
}
void do_parallel_work() {
#pragma omp parallel for
for (auto i = 0; i < parallel_vector.size(); ++i) {
for (auto integration_steps = 0; integration_steps < 30; integration_steps++) {
parallel_vector[i] += (0.05 - parallel_vector[i]) / 30.0;
}
}
}
private:
std::vector<double> parallel_vector;
std::vector<double> serial_vector;
};
void test_with_size(size_t size, int num_threads) {
std::cout << "Testing with " << num_threads << " and size: " << size << "\n";
omp_set_num_threads(num_threads);
Foo foo{ size };
long long total_dur_1 = 0;
long long total_dur_2 = 0;
for (auto i = 0; i < 500; i++) {
const auto tp_1 = std::chrono::high_resolution_clock::now();
foo.do_serial_work();
const auto tp_2 = std::chrono::high_resolution_clock::now();
foo.do_parallel_work();
const auto tp_3 = std::chrono::high_resolution_clock::now();
const auto dur_1 = std::chrono::duration_cast<std::chrono::microseconds>(tp_2 - tp_1).count();
const auto dur_2 = std::chrono::duration_cast<std::chrono::microseconds>(tp_3 - tp_2).count();
total_dur_1 += dur_1;
total_dur_2 += dur_2;
}
std::cout << total_dur_1 << "\t" << total_dur_2 << "\n";
}
int main(int argc, char** argv) {
test_with_size(100000, 1);
test_with_size(100000, 2);
test_with_size(100000, 4);
test_with_size(100000, 8);
return 0;
}
The slowdown happens on my local machine, a Win10 Laptop having an Intel Core i7-7700 with 4 cores and hyperthreading, 24 GB of RAM. The compiler is the latest in VisualStudio 2019. Compiled in RelWithDebugMode (from CMake, include /O2
and /openmp
).
It does not happen when I use a stronger machine, a CentOS 8 with 2x Intel Xeon Platinum 9242 with 48 cores each, no hyperthreading, 769 GB of RAM. The compiler is gcc/8.3.1. Compiled with g++ --std=c++17 -O3 -fopenmp
.
Timings on Win10 i7-7700:
Testing with 1 and size: 100000
3043846 10536315
Testing with 2 and size: 100000
3276611 5350204
Testing with 4 and size: 100000
3937311 2735655
Testing with 8 and size: 100000
5002727 1598775
and on CentOS 8, 2x Xeon Platinum 9242:
Testing with 1 and size: 100000
727756 4111363
Testing with 2 and size: 100000
731649 2069257
Testing with 4 and size: 100000
734019 1056157
Testing with 8 and size: 100000
752584 544373
So my initial thought was "There's too much pressure on the cache". However, when I removed virtually everything from the parallel section but the loop, the slowdown happened again.
Updated parallel section with the work taken out:
void do_parallel_work() {
#pragma omp parallel for
for (auto i = 0; i < 8; ++i) {
//for (auto integration_steps = 0; integration_steps < 30; integration_steps++) {
// parallel_vector[i] += (0.05 - parallel_vector[i]) / 30.0;
//}
}
}
Timings on Win10 with updated parallel section:
Testing with 1 and size: 100000
3206293 636
Testing with 2 and size: 100000
3218667 2672
Testing with 4 and size: 100000
3928818 8689
Testing with 8 and size: 100000
5106605 10797
Looking into the OpenMP 2.0 standard (VS does only support 2.0) (find it here: https://www.openmp.org/specifications/), it says in 2.7.2.5 lines 7,8:
In the absence of an explicit default clause, the default behavior is the same as if the default(shared) were specified.
And in 2.7.2.4 line 30:
All threads within the team access the same storage area for shared variables.
For me, this rules out that the OpenMP threads each copy serial_vector, which was the last explanation I could think of.
I'm happy for any explanation/ discussion on that matter, even if I just plainly missed something.
EDIT:
Out of curiosities sake, I also tested on my Win10 machine with the WSL. Runs gcc/9.3.0, and the timings are:
Testing with 1 and size: 100000
833678 2752
Testing with 2 and size: 100000
762877 1863
Testing with 4 and size: 100000
816440 1860
Testing with 8 and size: 100000
991184 2350
I'm honestly not sure why the windows executable takes so much longer an the same machine as the linux one (optimization /O2 for VC++ is max), but funnily enough, the same artifacts don't happen here.
OpenMP on Windows by default has 200ms spinlocks. It means when you leave the omp block then all omp worker threads are spinning waiting for new work. It has benefit if you have many omp blocks next to each other. In your case the threads just consume CPU power.
To disable/control the spinlocks you have several options:
OMP_WAIT_POLICY
and set it to PASSIVE
to disable spinlocs completely,KMP_BLOCKTIME
environment variable ,KMP_BLOCKTIME
environment variable.