Bug related to g++/OpenMP when using std::thread?

I've distilled the problem I have to its bare essentials. Here is the first example piece of code:

#include <vector>
#include <math.h>
#include <thread>

std::vector<double> vec(10000);

void run(void) 
{
    for(int l = 0; l < 500000; l++) {

    #pragma omp parallel for
        for(int idx = 0; idx < vec.size(); idx++) {

            vec[idx] += cos(idx);
        }
    }
}

int main(void)
{
    #pragma omp parallel
    {
    }

    std::thread threaded_call(&run);
    threaded_call.join();

    return 0;
}

Compile this as (on Ubuntu 20.04): g++ -fopenmp main.cpp -o main

EDIT: Version: g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Running on a Ryzen 3700x (8 cores, 16 threads) : run time ~43s, all 16 logical cores reported in System Monitor at ~80%.

Next take out the #pragma omp parallel directive, so the main function becomes:

int main(void)
{
    std::thread threaded_call(&run);
    threaded_call.join();

    return 0;
}

Now run time ~9s, all 16 logical cores reported in System Monitor at 100%.

I've also compiled this using MSVC on Windows 10, cpu utilization is always ~100% irrespective of the #pragma omp parallel directive being there or not. Yes I am fully aware this line should do absolutely nothing, yet with g++ it causes the above behaviour; also it only happens if calling the run function on a thread, not directly. I experimented with various compilation flags (-O levels) but problem remains. I suppose looking at the assembly code is the next step, but I can't see how this is anything but a bug in g++. Can anyone shed some light on this please? Would be much appreciated.

Furthermore, calling omp_set_num_threads(1); in the "void run(void)" function just before the loop, in order to check how long a single thread takes, gives ~70s run time with only one thread at 100% (as expected).

Further, possibly related problem (although this might be lack of understanding on my part): Calling omp_set_num_threads(1); in the "int main(void)" function (before threaded_call is defined) does nothing when compiling with g++, i.e. all 16 threads still execute in the for loop, irrespective of the bogus #pragma omp parallel directive. When compiling with MSVC this causes only one thread to run as expected - according to the documentation for omp_set_num_threads I though this should be the correct behaviour, but not so with g++. Why not, is this a further bug?

EDIT: this last problem I understand now (Overriding OMP_NUM_THREADS from code - for real), but still leaves the original problem outstanding.

Solution

Thank you to Hristo Iliev for useful comments, I now understand this and would like to answer my own question in case it's of use to anyone having similar issues.

The problem is if any OpenMP code is executed in the main program thread, its state becomes "polluted" - specifically after the "#pragma omp parallel" directive, OpenMP threads remain in a busy state (all 16) and this affects the performance of all other OpenMP code in any std::thread threads, which spawn their own team of OpenMP threads. Since the main thread only goes out of scope when the program finishes, this performance issue remains for the entire program execution. Thus if using OpenMP with std::thread make sure absolutely no OpenMP code exists in the main program thread.

To demonstrate this consider the following modified example code:

#include <vector>
#include <math.h>
#include <thread>

std::vector<double> vec(10000);

void run(void) 
{
    for(int l = 0; l < 500000; l++) {

    #pragma omp parallel for
        for(int idx = 0; idx < vec.size(); idx++) {

            vec[idx] += cos(idx);
        }
    }
}

void state(void)
{
#pragma omp parallel
    {
    }

    std::this_thread::sleep_for(std::chrono::milliseconds(5000));
}

int main(void)
{
    std::thread state_thread(&state);
    state_thread.detach();

    std::thread threaded_call(&run);
    threaded_call.join();

    return 0;
}

This code runs at 80% CPU utilization for the first 5 seconds, then runs at 100% CPU utilization for the duration of the program. This is because in the first std::thread a team of 16 OpenMP threads is spawned and remain in a busy state, thus affecting the performance of the OpenMP code in the second std::thread. As soon as the first std::thread terminates the performance of the second std::thread is not affected anymore since the second team of 16 OpenMP threads now doesn't have to compete for CPU access with the first. When the offending code was in the main thread the issue persisted until the end of the program.