I am trying to parallelise internally a ODE function which is integrated by ODEINT.
I have made the following small example
#include <iostream>
#include <chrono>
#include <Eigen/Dense>
#include <omp.h>
#include <boost/numeric/odeint.hpp>
#include <boost/numeric/odeint/external/openmp/openmp.hpp>
#include <boost/numeric/odeint/external/eigen/eigen.hpp>
using namespace boost::numeric::odeint;
class System {
private:
Eigen::VectorXd _input_data;
public:
System( Eigen::VectorXd &input_data ) { _input_data = input_data; };
void operator() ( const Eigen::VectorXd &x , Eigen::VectorXd &dxdt , const double t ) {
double _sum = 0.;
#pragma omp parallel for reduction(+:_sum)
for(int k = 0; k < _input_data.size(); ++k) {
_sum += _input_data(k);
};
dxdt(0) = _sum;
};
};
int main() {
omp_set_num_threads(1);
Eigen::VectorXd input_data = Eigen::VectorXd::Zero(100);
System ode(input_data);
runge_kutta_dopri5<Eigen::VectorXd> rk5_stepper;
Eigen::VectorXd x = Eigen::VectorXd::Zero(1);
auto start = std::chrono::high_resolution_clock::now();
size_t steps = integrate_const(rk5_stepper, ode, x, 0., 1., 0.01);
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Execution time: " << duration.count() / 1000000. << " sec" << std::endl;
return 0;
}
with the CMakeLists.txt file
cmake_minimum_required(VERSION 3.13)
set(CMAKE_C_COMPILER /usr/local/bin/gcc-9)
set(CMAKE_CXX_COMPILER /usr/local/bin/g++-9)
project(ODEINT_OPENMP_TEST)
set(CMAKE_CXX_STANDARD 14)
include_directories("/usr/local/include")
find_package(OpenMP REQUIRED)
add_executable(ODEINT_OPENMP_TEST main.cpp)
target_link_libraries(ODEINT_OPENMP_TEST PRIVATE OpenMP::OpenMP_CXX)
when I try to use more threads via omp_set_num_threads(N)
, the program consistently slows down compared to using only a single thread omp_set_num_threads(1)
. The program becomes approximately x3 slower (on my machine) choosing N=2.
Intuitively, the right-hand-side function should run faster in parallel? Am I doing something wrong?
First of all, your loop is too small to benefit from using multiple threads on this specific example (about 3.5 ms on my machine in sequential, 1.8 ms using 6 threads on 6 cores).
Additionally, your benchmark is too short and you probably measure unexpected effects (caching, page faults, processor frequency scaling issues, etc.). Consider putting this in a loop to mitigate most effects (if this make sense in real-world conditions).
Moreover, some OpenMP runtimes create threads when parallel section are executed. This operation is quite slow. Since the directive #pragma omp parallel
is included in the timing, you could also measure thread creation.
Here are the results on my 6-cores machine with a size 1000 times bigger:
1 thread: 2.330 sec
2 threads: 1.212 sec
3 threads: 0.813 sec
4 threads: 0.649 sec
5 threads: 0.532 sec
6 threads: 0.459 sec
The speed-up is 5.1 with 6 threads, which is good.
Please note that you may get a worse scaling since your loop seems memory bound (the memory throughput does not scale with the number of cores used).