I've run into a minor HPC problem after running some tests on a 80core (160HT) nehalem architecture with 2Tb DRAM:
A server with more than 2 sockets starts to stall a lot (delay) as each thread starts to request information about objects on the "wrong" socket, i.e. requests goes from a thread that is working on some objects on the one socket to pull information that is actually in the DRAM on the other socket.
The cores appear 100% utilized, even though I know that they are waiting for the remote socket to return the request.
As most of the code runs asynchronously it is a lot easier to rewrite the code so I can just parse messages from the threads on the one socket to threads the other (no locked waiting). In addition I want to lock each threads to memory pools, so I can update objects instead of wasting time (~30%) on the garbage collector.
Hence the question:
How to pin threads to cores with predetermined memory pool objects in Python?
A little more context:
Python has no problem running multicore when you put ZeroMQ in the middle and make an art out of passing messages between the memory pool managed by each ZMQworker. At ZMQ's 8M msg/second it the internal update of the objects take longer than the pipeline can be filled. This is all described here: http://zguide.zeromq.org/page:all#Chapter-Sockets-and-Patterns
So, with a little over-simplification, I spawn 80 ZMQworkerprocesses and 1 ZMQrouter and load the context with a large swarm of objects (584 million objects actually). From this "start-point" the objects need to interact to complete the computation.
This is the idea:
To do this I need to know:
But I cannot find references in the python docs on how to do this and on google I must be searching for the wrong thing.
Update:
Regarding the question "why use ZeroMQ on a MPI architecture?", please read the thread: Spread vs MPI vs zeromq? as the application I am working on is being designed for a distributed deployment even though it is tested on a an architecture where MPI is more suitable.
Update 2:
Regarding the question:
"How to pin threads to cores with predetermined memory pools in Python(3)" the answer is in psutils:
>>> import psutil
>>> psutil.cpu_count()
4
>>> p = psutil.Process()
>>> p.cpu_affinity() # get
[0, 1, 2, 3]
>>> p.cpu_affinity([0]) # set; from now on, this process will run on CPU #0 only
>>> p.cpu_affinity()
[0]
>>>
>>> # reset affinity against all CPUs
>>> all_cpus = list(range(psutil.cpu_count()))
>>> p.cpu_affinity(all_cpus)
>>>
The worker can be pegged to a core whereby the NUMA may be exploited effectively (lookup your CPU type to verify that it is a NUMA architecture!)
The second element is to determine the memory-pool. That can be done with psutils as well or the resource library:
You might underestimate the issue, there is no super-easy way to accomplish what you want. As a general guideline, you need to work at the operating system level to get things set up the way you want. You want to work with so-called "CPU affinity" and "memory affinity" and you need to think hard about your system architecture as well as your software architecture to get things right. In real HPC, the named "affinities" are normally handled by an MPI library, such as Open MPI. You might want to consider using one and let your different processes be handled by that MPI library. The interface between operating system, MPI library and Python can be provided by the mpi4py package.
You also need to get your concept of threads and processes and the OS setting straight. While for the CPU time scheduler, a thread is a task to be scheduled and therefore theoretically could have an individual affinity, I am only aware of affinity masks for entire processes, i.e. for all threads within one process. For controlling memory access, NUMA (non-uniform memory access) is the right keyword and you might want to look into http://linuxmanpages.com/man8/numactl.8.php
In any case, you need to read articles about the affinity topic and might want to start reading in the Open MPI FAQs about CPU/memory affinity: http://www.open-mpi.de/faq/?category=tuning#paffinity-defs
In case you want to achieve your goal without using an MPI library, look into the packages util-linux
or schedutils
and numactl
of your Linux distribution in order to get useful commandline tools such as taskset
, which you could e.g. call from within Python in order to set affinity masks for certain process IDs.
This article seems to vividly describe how an MPI library can be helpful with your issue:
http://blogs.cisco.com/performance/open-mpi-v1-5-processor-affinity-options/
This SO answer describes how you bisect your hardware architecture: https://stackoverflow.com/a/11761943/145400
Generally, I am wondering if the machine you are applying is the right one for the task or if you maybe are optimizing at the wrong end. If you are messaging within one machine and hitting memory bandwidth limits, I am not sure if ZMQ (through TCP/IP, right?) is the right tool at all to perform the messaging. Coming back to MPI, the message passing interface for HPC applications...