c++multithreading openmp performance-testing latency

Measuring OpenMP Fork/Join latency

Since MPI-3 comes with functionality for shared memory parallelism, and it seems to be perfectly matched for my application, I'm critically considering rewriting my hybrid OpemMP-MPI code into a pure MPI implementation.

In order to drive the last nail into the coffin, I decided to run a small program to test the latency of the OpenMP fork/join mechanism. Here's the code (written for Intel compiler):

void action1(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::sin(t2.data()[index]) * std::cos(t2.data()[index]);
    }
}

void action2(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = t2.data()[index] * std::sin(t2.data()[index]);
    }
}

void action3(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = t2.data()[index] * t2.data()[index];
    }
}

void action4(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::sqrt(t2.data()[index]);
    }
}

void action5(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = t2.data()[index] * 2.0;
    }
}

void all_actions(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::sin(t2.data()[index]) * std::cos(t2.data()[index]);
        t1.data()[index] = t2.data()[index] * std::sin(t2.data()[index]);
        t1.data()[index] = t2.data()[index] * t2.data()[index];
        t1.data()[index] = std::sqrt(t2.data()[index]);
        t1.data()[index] = t2.data()[index] * 2.0;
    }
}


int main()
{
    // decide the process parameters
    const auto n = std::size_t{8000000};
    const auto test_count = std::size_t{500};
    
    // garbage data...
    auto t1 = std::vector<double>(n);
    auto t2 = std::vector<double>(n);
    
    //+/////////////////
    // perform actions one after the other
    //+/////////////////
    
    const auto sp = timer::spot_timer();
    const auto dur1 = sp.duration_in_us();
    for (auto index = std::size_t{}; index < test_count; ++index)
    {
        #pragma noinline
        action1(t1, t2);
        #pragma noinline
        action2(t1, t2);
        #pragma noinline
        action3(t1, t2);
        #pragma noinline
        action4(t1, t2);
        #pragma noinline
        action5(t1, t2);
    }
    const auto dur2 = sp.duration_in_us();
    
    //+/////////////////
    // perform all actions at once
    //+/////////////////
    const auto dur3 = sp.duration_in_us();
    for (auto index = std::size_t{}; index < test_count; ++index)
    {
        #pragma noinline
        all_actions(t1, t2);
    }
    const auto dur4 = sp.duration_in_us();
    
    const auto a = dur2 - dur1;
    const auto b = dur4 - dur3;
    if (a < b)
    {
        throw std::logic_error("negative_latency_error");
    }
    const auto fork_join_latency = (a - b) / (test_count * 4);
    
    // report
    std::cout << "Ran the program with " << omp_get_max_threads() << ", the calculated fork/join latency is: " << fork_join_latency << " us" << std::endl;
    
    return 0;
}

As you can see, the idea is to perform a set of actions separately (each within an OpenMP loop) and to calculate the average duration of this, and then to perform all these actions together (within the same OpenMP loop) and to calculate the average duration of that. Then we have a linear system of equations in two variables, one of which is the latency of the fork/join mechanism, which can be solved to obtain the value.

Questions:

Am I overlooking something?
Currently, I am using "-O0" to prevent smarty-pants compiler from doing its funny business. Which compiler optimizations should I use, would these also have an effect on the latency itself etc etc?
On my Coffee Lake processor with 6 cores, I measured a latency of ~850 us. Does this sound about right?

Edit 3

) I've included a warm-up calculation in the beginning upon @paleonix's suggestion,
) I've reduced the number of actions for simplicity, and,
) I've switched to 'omp_get_wtime' to make it universally understandable.

I am now running the following code with flag -O3:

void action1(std::vector<double>& t1)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::sin(t1.data()[index]);
    }
}

void action2(std::vector<double>& t1)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] =  std::cos(t1.data()[index]);
    }
}

void action3(std::vector<double>& t1)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::atan(t1.data()[index]);
    }
}

void all_actions(std::vector<double>& t1, std::vector<double>& t2, std::vector<double>& t3)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        #pragma optimize("", off)
        t1.data()[index] = std::sin(t1.data()[index]);
        t2.data()[index] = std::cos(t2.data()[index]);
        t3.data()[index] = std::atan(t3.data()[index]);
        #pragma optimize("", on)
    }
}


int main()
{
    // decide the process parameters
    const auto n = std::size_t{1500000}; // 12 MB (way too big for any cache)
    const auto experiment_count = std::size_t{1000};
    
    // garbage data...
    auto t1 = std::vector<double>(n);
    auto t2 = std::vector<double>(n);
    auto t3 = std::vector<double>(n);
    auto t4 = std::vector<double>(n);
    auto t5 = std::vector<double>(n);
    auto t6 = std::vector<double>(n);
    auto t7 = std::vector<double>(n);
    auto t8 = std::vector<double>(n);
    auto t9 = std::vector<double>(n);
    
    //+/////////////////
    // warum-up, initialization of threads etc.
    //+/////////////////
    for (auto index = std::size_t{}; index < experiment_count / 10; ++index)
    {
        all_actions(t1, t2, t3);
    }
    
    //+/////////////////
    // perform actions (part A)
    //+/////////////////
    
    const auto dur1 = omp_get_wtime();
    for (auto index = std::size_t{}; index < experiment_count; ++index)
    {
        action1(t4);
        action2(t5);
        action3(t6);
    }
    const auto dur2 = omp_get_wtime();
    
    //+/////////////////
    // perform all actions at once (part B)
    //+/////////////////

    const auto dur3 = omp_get_wtime();
    #pragma nofusion
    for (auto index = std::size_t{}; index < experiment_count; ++index)
    {
        all_actions(t7, t8, t9);
    }
    const auto dur4 = omp_get_wtime();
    
    const auto a = dur2 - dur1;
    const auto b = dur4 - dur3;
    const auto fork_join_latency = (a - b) / (experiment_count * 2);
    
    // report
    std::cout << "Ran the program with " << omp_get_max_threads() << ", the calculated fork/join latency is: "
        << fork_join_latency * 1E+6 << " us" << std::endl;
    
    return 0;
}

With this, the measured latency is now 115 us. What's puzzling me now is that this value changes when the actions are changed. According to my logic, since I'm doing the same action in both parts A and B, there should actually be no change. Why is this happening?

Solution

Here is my attempt at measuring fork-join overhead:

#include <iostream>
#include <string>

#include <omp.h>

constexpr int n_warmup = 10'000;
constexpr int n_measurement = 100'000;
constexpr int n_spins = 1'000;

void spin() {
    volatile bool flag = false;
    for (int i = 0; i < n_spins; ++i) {
        if (flag) {
            break;
        }
    }
}

void bench_fork_join(int num_threads) {
    omp_set_num_threads(num_threads);

    // create threads, warmup
    for (int i = 0; i < n_warmup; ++i) {
        #pragma omp parallel
        spin();
    }

    double const start = omp_get_wtime();
    for (int i = 0; i < n_measurement; ++i) {
        #pragma omp parallel
        spin();
    }
    double const stop = omp_get_wtime();
    double const ptime = (stop - start) * 1e6 / n_measurement;

    // warmup
    for (int i = 0; i < n_warmup; ++i) {
        spin();
    }
    double const sstart = omp_get_wtime();
    for (int i = 0; i < n_measurement; ++i) {
        spin();
    }
    double const sstop = omp_get_wtime();
    double const stime = (sstop - sstart) * 1e6 / n_measurement;

    std::cout << ptime << " us\t- " << stime << " us\t= " << ptime - stime << " us\n";
}

int main(int argc, char **argv) {
    auto const params = argc - 1;
    std::cout << "parallel\t- sequential\t= overhead\n";

    for (int j = 0; j < params; ++j) {
        auto num_threads = std::stoi(argv[1 + j]);
        std::cout << "---------------- num_threads = " << num_threads << " ----------------\n";
        bench_fork_join(num_threads);
    }

    return 0;
}

You can call it with multiple different numbers of threads which should not be higher then the number of cores on your machine to give reasonable results. On my machine with 6 cores and compiling with gcc 11.2, I get

$ g++ -fopenmp -O3 -DNDEBUG -o bench-omp-fork-join bench-omp-fork-join.cpp
$ ./bench-omp-fork-join 6 4 2 1
parallel        - sequential    = overhead
---------------- num_threads = 6 ----------------
1.51439 us      - 0.273195 us   = 1.24119 us
---------------- num_threads = 4 ----------------
1.24683 us      - 0.276122 us   = 0.970708 us
---------------- num_threads = 2 ----------------
1.10637 us      - 0.270865 us   = 0.835501 us
---------------- num_threads = 1 ----------------
0.708679 us     - 0.269508 us   = 0.439171 us

In each line the first number is the average (over 100'000 iterations) with threads and the second number is the average without threads. The last number is the difference between the first two and should be an upper bound on the fork-join overhead.

Make sure that the numbers in the middle column (no threads) are approximately the same in every row, as they should be independent of the number of threads. If they aren't, make sure there is nothing else running on the computer and/or increase the number of measurements and/or warmup runs.

In regard to exchanging OpenMP for MPI, keep in mind that MPI is still multiprocessing and not multithreading. You might pay a lot of memory overhead because processes tend to be much bigger than threads.

EDIT:

Revised benchmark to use spinning on a volatile flag instead of sleeping (Thanks @Jérôme Richard). As Jérôme Richard mentioned in his answer, the measured overhead grows with n_spins. Setting n_spins below 1000 didn't significantly change the measurement for me, so that is where I measured. As one can see above, the measured overhead is way lower than what the earlier version of the benchmark measured.

The inaccuracy of sleeping is a problem especially because one will always measure the thread that sleeps the longest and therefore get a bias to longer times, even if sleep times themselves would be distributed symmetrically around the input time.