Core isolation limits OpenMP to a single core

After investigating a performance issue of a code, I realized that an OpenMP-based parallel code running on isolated cores limits the number of threads to a single core.

This code should unroll the for-loop to N cores (e.g. $OMP_NUM_THREADS=N):

// g++ -fopenmp -I/usr/include/ -03 test_openmp.cpp -o testomp
#include <immintrin.h>
#include <cmath>
#include <chrono>
#include <iostream>
const float nano = 1000000000;
int main(){
    std::size_t niter = 10000000000;
    auto start = std::chrono::system_clock::now();
    #pragma omp parallel for
    for(std::size_t i = 0; i < niter; i++){
        std::size_t x = sqrt(i);
    }
    auto elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::system_clock::now() - start);
    std::cout << "Took " << elapsed.count()/nano << " s" << std::endl;
}

On my dev system, there are 32 cores; 0-15 are isolated, while 16-31 are not. Running the code on the isolated cores gives:

OMP_NUM_THREADS=4 taskset -c 0-3 ./testomp
Took 5.33671 s

On the non-isolated cores

OMP_NUM_THREADS=4 taskset -c 16-19 ./testomp 
Took 1.33193 s

In addition, htop shows one 100% utilized core for the test with the isolated CPUs while the test with the non-isolated CPUs shows 4 CPU cores are 100% utilized.

Is there a way to allow OpenMP to use multiple isolated cores in a parallel for-loop? Using the isolated cores would be essential for performance reasons (avoiding kernel tasks, etc.).

If not, how does OpenMP differentiate between isolated and non-isolated cores?

EDIT

I enabled optimization with -0 3. Of course, the execution time decreases, but this is not the point (and also not benchmarking). The point is that the code runs on multiple non-isolated cores but only one isolated core. However, on the isolated core, I see four threads running.

I am using the GNU C++ compiler:

~$ g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

OpenMP version is 4.5

The CPU and its configuration:

~$ lscpu
lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      52 bits physical, 57 bits virtual
CPU(s):                             128
On-line CPU(s) list:                0-63
Off-line CPU(s) list:               64-127
Thread(s) per core:                 1
Core(s) per socket:                 32
Socket(s):                          2
NUMA node(s):                       2
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              17
Model name:                         AMD EPYC 9354 32-Core Processor
Stepping:                           1
Frequency boost:                    enabled
CPU MHz:                            1499.458
CPU max MHz:                        3799.0720
CPU min MHz:                        1500.0000
BogoMIPS:                           6499.69
Virtualization:                     AMD-V
L1d cache:                          2 MiB
L1i cache:                          2 MiB
L2 cache:                           64 MiB
L3 cache:                           512 MiB
NUMA node0 CPU(s):                  0-31

Solution

This should work for GNU and Intel OpenMP to run the program on your isolated cores (0-3):

OMP_PLACES="{0}:4" taskset --cpu-list 0-3 command

and this to run it on cores 16-19

OMP_PLACES="{16}:4" taskset --cpu-list 16-19 command

Some OpenMP implementations, such as GNU libgomp and Intel libiomp, respect the affinity mask of the process, hence it has to be modified using taskset. Why OMP_PLACES (or equivalent specifiers and functions) is needed in addition to taskset is beyond my knowledge.

See also for more details Running OpenMP threads on isolated CPUs