After investigating a performance issue of a code, I realized that an OpenMP-based parallel code running on isolated cores limits the number of threads to a single core.
This code should unroll the for-loop to N cores (e.g. $OMP_NUM_THREADS=N
):
// g++ -fopenmp -I/usr/include/ -03 test_openmp.cpp -o testomp
#include <immintrin.h>
#include <cmath>
#include <chrono>
#include <iostream>
const float nano = 1000000000;
int main(){
std::size_t niter = 10000000000;
auto start = std::chrono::system_clock::now();
#pragma omp parallel for
for(std::size_t i = 0; i < niter; i++){
std::size_t x = sqrt(i);
}
auto elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::system_clock::now() - start);
std::cout << "Took " << elapsed.count()/nano << " s" << std::endl;
}
On my dev system, there are 32 cores; 0-15 are isolated, while 16-31 are not. Running the code on the isolated cores gives:
OMP_NUM_THREADS=4 taskset -c 0-3 ./testomp
Took 5.33671 s
On the non-isolated cores
OMP_NUM_THREADS=4 taskset -c 16-19 ./testomp
Took 1.33193 s
In addition, htop
shows one 100% utilized core for the test with the isolated CPUs while the test with the non-isolated CPUs shows 4 CPU cores are 100% utilized.
Is there a way to allow OpenMP to use multiple isolated cores in a parallel for-loop? Using the isolated cores would be essential for performance reasons (avoiding kernel tasks, etc.).
If not, how does OpenMP differentiate between isolated and non-isolated cores?
EDIT
I enabled optimization with -0 3
. Of course, the execution time decreases, but this is not the point (and also not benchmarking). The point is that the code runs on multiple non-isolated cores but only one isolated core. However, on the isolated core, I see four threads running.
I am using the GNU C++ compiler:
~$ g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
OpenMP version is 4.5
The CPU and its configuration:
~$ lscpu
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 52 bits physical, 57 bits virtual
CPU(s): 128
On-line CPU(s) list: 0-63
Off-line CPU(s) list: 64-127
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 25
Model: 17
Model name: AMD EPYC 9354 32-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 1499.458
CPU max MHz: 3799.0720
CPU min MHz: 1500.0000
BogoMIPS: 6499.69
Virtualization: AMD-V
L1d cache: 2 MiB
L1i cache: 2 MiB
L2 cache: 64 MiB
L3 cache: 512 MiB
NUMA node0 CPU(s): 0-31
This should work for GNU and Intel OpenMP to run the program on your isolated cores (0-3):
OMP_PLACES="{0}:4" taskset --cpu-list 0-3 command
and this to run it on cores 16-19
OMP_PLACES="{16}:4" taskset --cpu-list 16-19 command
Some OpenMP implementations, such as GNU libgomp and Intel libiomp, respect the affinity mask of the process, hence it has to be modified using taskset
. Why OMP_PLACES
(or equivalent specifiers and functions) is needed in addition to taskset
is beyond my knowledge.
See also for more details Running OpenMP threads on isolated CPUs