Core isolation limits OpenMP to a single core

After investigating a performance issue of a code, I realized that an OpenMP-based parallel code running on isolated cores limits the number of threads to a single core.

This code should unroll the for-loop to N cores (e.g. $OMP_NUM_THREADS=N):

// g++ -fopenmp -I/usr/include/ -03 test_openmp.cpp -o testomp
#include <immintrin.h>
#include <cmath>
#include <chrono>
#include <iostream>
const float nano = 1000000000;
int main(){
    std::size_t niter = 10000000000;
    auto start = std::chrono::system_clock::now();
    #pragma omp parallel for
    for(std::size_t i = 0; i < niter; i++){
        std::size_t x = sqrt(i);
    }
    auto elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::system_clock::now() - start);
    std::cout << "Took " << elapsed.count()/nano << " s" << std::endl;
}

On my dev system, there are 32 cores; 0-15 are isolated, while 16-31 are not. Running the code on the isolated cores gives:

OMP_NUM_THREADS=4 taskset -c 0-3 ./testomp
Took 5.33671 s

On the non-isolated cores

OMP_NUM_THREADS=4 taskset -c 16-19 ./testomp 
Took 1.33193 s

In addition, htop shows one 100% utilized core for the test with the isolated CPUs while the test with the non-isolated CPUs shows 4 CPU cores are 100% utilized.

Is there a way to allow OpenMP to use multiple isolated cores in a parallel for-loop? Using the isolated cores would be essential for performance reasons (avoiding kernel tasks, etc.).

If not, how does OpenMP differentiate between isolated and non-isolated cores?

EDIT

I enabled optimization with -0 3. Of course, the execution time decreases, but this is not the point (and also not benchmarking). The point is that the code runs on multiple non-isolated cores but only one isolated core. However, on the isolated core, I see four threads running.

I am using the GNU C++ compiler:

~$ g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

OpenMP version is 4.5

The CPU and its configuration:

~$ lscpu
lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      52 bits physical, 57 bits virtual
CPU(s):                             128
On-line CPU(s) list:                0-63
Off-line CPU(s) list:               64-127
Thread(s) per core:                 1
Core(s) per socket:                 32
Socket(s):                          2
NUMA node(s):                       2
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              17
Model name:                         AMD EPYC 9354 32-Core Processor
Stepping:                           1
Frequency boost:                    enabled
CPU MHz:                            1499.458
CPU max MHz:                        3799.0720
CPU min MHz:                        1500.0000
BogoMIPS:                           6499.69
Virtualization:                     AMD-V
L1d cache:                          2 MiB
L1i cache:                          2 MiB
L2 cache:                           64 MiB
L3 cache:                           512 MiB
NUMA node0 CPU(s):                  0-31

Solution

I think you are asking OpenMP to do something that is impossible. It sits at the bottom of the resource allocation hierarchy, and cannot overcome rules that are enforced higher up. (Indeed, in general, doing so would be a bad thing; imagine an MPI code running four OpenMP processes on the same node and partitioning the logical CPUs to avoid interference; OpenMP ignoring that and running each process on all logical CPUs would over-subscribe the machine!)

Looking at the cpuset man page, we see

Cpusets are integrated with the sched_setaffinity(2) scheduling affinity mechanism and the mbind(2) and set_mempolicy(2) memory- placement mechanisms in the kernel. Neither of these mechanisms let a process make use of a CPU or memory node that is not allowed by that process's cpuset. If changes to a process's cpuset placement conflict with these other mechanisms, then cpuset placement is enforced even if it means overriding these other mechanisms.

So it is explicitly saying that the cpuset implementation is designed to stop code from escaping, yet that is what you seem to want.

Therefore I think this has nothing much to do with OpenMP, which is merely obeying the rules of the Linux system on which it is running.