Why doesn't openmp place threads based on manual NUMA bind?

I'm building a numa-aware processor that binds to a given socket and accepts lambdas. Here is what I've done:

#include <numa.h>
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <thread>
#include <vector>

using namespace std;

unsigned nodes = numa_num_configured_nodes();
unsigned cores = numa_num_configured_cpus();
unsigned cores_per_node = cores / nodes;

int main(int argc, char* argv[]) {
    putenv("OMP_PLACES=sockets(1)");
    cout << numa_available() << endl;  // returns 0
    numa_set_interleave_mask(numa_all_nodes_ptr);
    int size = 200000000;
    for (auto i = 0; i < nodes; ++i) {
        auto t = thread([&]() {
            // binding to given socket
            numa_bind(numa_parse_nodestring(to_string(i).c_str()));
            vector<int> v(size, 0);
            cout << "node #" << i << ": on CPU " << sched_getcpu() << endl;
#pragma omp parallel for num_threads(cores_per_node) proc_bind(master)
            for (auto i = 0; i < 200000000; ++i) {
                for (auto j = 0; j < 10; ++j) {
                    v[i]++;
                    v[i] *= v[i];
                    v[i] *= v[i];
                }
            }
        });
        t.join();
    }
}

However, all threads are running on socket 0. It seems numa_bind doesn't bind current thread to the given socket. The second numa processor -- Numac 1 outputs node #1: on CPU 0, which should be on CPU 1. So what's going wrong?

Solution

This works for me exactly as I expected:

#include <cassert>
#include <iostream>
#include <numa.h>
#include <omp.h>
#include <sched.h>

int main() {
   assert (numa_available() != -1);

   auto nodes = numa_num_configured_nodes();
   auto cores = numa_num_configured_cpus();
   auto cores_per_node = cores / nodes;

   omp_set_nested(1);

   #pragma omp parallel num_threads(nodes)
   {
      auto outer_thread_id = omp_get_thread_num();
      numa_run_on_node(outer_thread_id);

      #pragma omp parallel num_threads(cores_per_node)
      {
         auto inner_thread_id = omp_get_thread_num();

         #pragma omp critical
         std::cout
            << "Thread " << outer_thread_id << ":" << inner_thread_id
            << " core: " << sched_getcpu() << std::endl;

         assert(outer_thread_id == numa_node_of_cpu(sched_getcpu()));
      }
   }
}

Program first create 2 (outer) threads on my dual-socket server. Then, it binds them to different sockets (NUMA nodes). Finally, it splits each thread into 20 (inner) threads, since each CPU has 10 physical cores and enabled hyperthreading.

All inner threads are running on the same socket as its parent thread. That is on cores 0-9 and 20-29 for outer thread 0, and on cores 10-19 and 30-39 for outer thread 1. (sched_getcpu() returned the number of virtual core from range 0-39 in my case.)

Note that there is no C++11 threading, just pure OpenMP.