I am trying to run a job that uses explicit message passing between nodes on SLURM (i.e. not just running parallel jobs) but am getting a recurring error that "a request was made to bind to that would result in binding more processes than cpus on a resource". Briefly, my code requires sending an array of parameters across 128 nodes, calculating a likelihood of those parameters, and gathering the sum of those likelihood values back to the root node. I got the error when executing the code using the following sbatch file:
#!/bin/bash
#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=16
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00
# Load the default OpenMPI module.
module load openmpi
mpiexec -N 8 ./linesearch
I thought that using -N 8
would explicitly assign 8 processes-per-node to 16 --ntasks-per-node
. I thought that using this method, which is an inefficient use of computer processing space, would reduce this error following a response to a different overflow thread, but it didn't resolve the issue.
The full error message, if useful, is as follows:
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: NONE:IF-SUPPORTED
Node: XXXXXX
#processes: 4
#cpus: 3
You can override this protection by adding the "overload-allowed"
option to your binding directive.
The processes that I'm executing can be memory intensive, so I don't want to necessarily use the overload override in the risk of jobs terminating after exhausting allocation.
Note that I was loading module openmpi v2.0.1 [retired]. However, changing the sbatch file to bind to socket with only -np 128
tasks resolved this issue
sbatch file:
#!/bin/bash
#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks=128
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00
# Load the default OpenMPI module.
module load openmpi
mpiexec -np 128 ./execs/linesearch $1 $2
An alternative solution is to use --bind-to core --map-by core
in the mpiexec
statement to bind each process to a core