Search code examples
c++openmpodeint

OpenMP with ODEINT in ODE function


I am trying to parallelise internally a ODE function which is integrated by ODEINT.

I have made the following small example

#include <iostream>
#include <chrono>
#include <Eigen/Dense>
#include <omp.h>
#include <boost/numeric/odeint.hpp>
#include <boost/numeric/odeint/external/openmp/openmp.hpp>
#include <boost/numeric/odeint/external/eigen/eigen.hpp>

using namespace boost::numeric::odeint;

class System {
private:
    Eigen::VectorXd _input_data;
public:
    System( Eigen::VectorXd &input_data ) { _input_data = input_data; };
    void operator() ( const Eigen::VectorXd &x , Eigen::VectorXd &dxdt , const double t ) {
        double _sum = 0.;
        #pragma omp parallel for reduction(+:_sum)
        for(int k = 0; k < _input_data.size(); ++k) {
            _sum += _input_data(k);
        };
        dxdt(0) = _sum;
    };
};

int main() {
    omp_set_num_threads(1);
    Eigen::VectorXd input_data = Eigen::VectorXd::Zero(100);
    System ode(input_data);
    runge_kutta_dopri5<Eigen::VectorXd> rk5_stepper;
    Eigen::VectorXd x = Eigen::VectorXd::Zero(1);
    auto start = std::chrono::high_resolution_clock::now();
    size_t steps = integrate_const(rk5_stepper, ode, x, 0., 1., 0.01);
    auto stop = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
    std::cout << "Execution time: " << duration.count() / 1000000. << " sec" << std::endl;
    return 0;
}

with the CMakeLists.txt file

cmake_minimum_required(VERSION 3.13)

set(CMAKE_C_COMPILER /usr/local/bin/gcc-9)
set(CMAKE_CXX_COMPILER /usr/local/bin/g++-9)

project(ODEINT_OPENMP_TEST)

set(CMAKE_CXX_STANDARD 14)

include_directories("/usr/local/include")

find_package(OpenMP REQUIRED)

add_executable(ODEINT_OPENMP_TEST main.cpp)

target_link_libraries(ODEINT_OPENMP_TEST PRIVATE OpenMP::OpenMP_CXX)

when I try to use more threads via omp_set_num_threads(N), the program consistently slows down compared to using only a single thread omp_set_num_threads(1). The program becomes approximately x3 slower (on my machine) choosing N=2.

Intuitively, the right-hand-side function should run faster in parallel? Am I doing something wrong?


Solution

  • First of all, your loop is too small to benefit from using multiple threads on this specific example (about 3.5 ms on my machine in sequential, 1.8 ms using 6 threads on 6 cores).

    Additionally, your benchmark is too short and you probably measure unexpected effects (caching, page faults, processor frequency scaling issues, etc.). Consider putting this in a loop to mitigate most effects (if this make sense in real-world conditions).

    Moreover, some OpenMP runtimes create threads when parallel section are executed. This operation is quite slow. Since the directive #pragma omp parallel is included in the timing, you could also measure thread creation.

    Here are the results on my 6-cores machine with a size 1000 times bigger:

    1 thread:  2.330 sec
    2 threads: 1.212 sec
    3 threads: 0.813 sec
    4 threads: 0.649 sec
    5 threads: 0.532 sec
    6 threads: 0.459 sec
    

    The speed-up is 5.1 with 6 threads, which is good.

    Please note that you may get a worse scaling since your loop seems memory bound (the memory throughput does not scale with the number of cores used).