"A request was made to bind to that would result in binding more processes than cpus on a resource" mpirun command (for mpi4py)

I am running OpenAI baselines, specifically the Hindsight Experience Replay code. (However, I think this question is independent of the code and is an MPI-related one, hence why I'm posting on StackOverflow.)

You can see the README there but the point is, the command to run is:

python -m baselines.her.experiment.train --num_cpu 20

where the number of CPUs can vary and is for MPI.

I am successfully running the HER training script with 1-4 CPUs (i.e., --num_cpu x for x=1,2,3,4) on a single machine with:

Ubuntu 16.04
Python 3.5.2
TensorFlow 1.5.0
One TitanX GPU

The number of CPUs seems to be 8 as I have a quad-core i7 Intel processor with hyperthreading, and Python confirms that it sees 8 CPUs.

(py3-tensorflow) daniel@titan:~/baselines$ ipython
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import os, multiprocessing

In [2]: os.cpu_count()
Out[2]: 8

In [3]: multiprocessing.cpu_count()
Out[3]: 8

Unfortunately, when I run with 5 or more CPUs, I get this message blocking the code from running:

(py3-tensorflow) daniel@titan:~/baselines$ python -m baselines.her.experiment.train --num_cpu 5
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        titan
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

And here's where I got lost. There's no error message or line of code that I need to fix. I am therefore unsure about where I even add overload-allowed in the code?

The way this code works at a high level is that it takes in this argument and uses the python subprocess module to run an mpirun command. However, checking mpirun --help on the command line doesn't reveal overload-allowed as a valid argument.

Googling this error message leads to questions in the openmpi repository, for instance:

https://github.com/open-mpi/ompi/issues/626 (seems to have died out without resolving issue)
https://github.com/open-mpi/ompi/issues/2158 (not sure how this relates to my issue, didn't get clear resolution)

But I'm not sure if it's an OpenMPI thing or an mpi4py thing?

Here's pip list in my virtual environment if it helps:

(py3.5-mpi-practice) daniel@titan:~$ pip list
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
decorator (4.2.1)
ipython (6.2.1)
ipython-genutils (0.2.0)
jedi (0.11.1)
line-profiler (2.1.2)
mpi4py (3.0.0)
numpy (1.14.1)
parso (0.1.1)
pexpect (4.4.0)
pickleshare (0.7.4)
pip (9.0.1)
pkg-resources (0.0.0)
pprintpp (0.3.0)
prompt-toolkit (1.0.15)
ptyprocess (0.5.2)
Pygments (2.2.0)
setuptools (20.7.0)
simplegeneric (0.8.1)
six (1.11.0)
traitlets (4.3.2)
wcwidth (0.1.7)

So, TL;DR:

How do I fix this error in my code?
If I add the "overload-allowed" thing, what happens? Is it safe?

Thanks!

Solution

overload-allowed is a qualifier that is passed to --bind-to parameter of mpirun (source).

mpirun ... --bind-to core:overload-allowed

Beware that hyperthreading thing is more about marketing than about performance bonuses.

Your i7 can actually have four silicon cores and four "logical" ones. The logical ones basically try to use resources of the silicon cores that are currently unused. The problem is that a good HPC program will use 100% of the CPU hardware, and hyperthreading won't have resources to successfully operate.

So, it is safe to "overload" "cores", but it's not a performance boost candidate #1.

Regarding the advice that the paper authors give about reproducing the results. In the best case less cpus just means slow learning. However, if learning doesn't converge to an expected value no matter how hyperparams are tweaked, then it is a reason to look closer at the proposed algorithm.

While IEEE754 computations do differ if done in different order, this difference should not play the crucial role.