Search code examples
mpislurmopenmpisbatch

Binding more processes than cpus error in SLURM openmpi


I am trying to run a job that uses explicit message passing between nodes on SLURM (i.e. not just running parallel jobs) but am getting a recurring error that "a request was made to bind to that would result in binding more processes than cpus on a resource". Briefly, my code requires sending an array of parameters across 128 nodes, calculating a likelihood of those parameters, and gathering the sum of those likelihood values back to the root node. I got the error when executing the code using the following sbatch file:

#!/bin/bash

#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=16
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00

# Load the default OpenMPI module.
module load openmpi

mpiexec -N 8 ./linesearch

I thought that using -N 8 would explicitly assign 8 processes-per-node to 16 --ntasks-per-node. I thought that using this method, which is an inefficient use of computer processing space, would reduce this error following a response to a different overflow thread, but it didn't resolve the issue.

The full error message, if useful, is as follows:

A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     NONE:IF-SUPPORTED
   Node:        XXXXXX
   #processes:  4
   #cpus:       3

You can override this protection by adding the "overload-allowed"
option to your binding directive.

The processes that I'm executing can be memory intensive, so I don't want to necessarily use the overload override in the risk of jobs terminating after exhausting allocation.


Solution

  • Note that I was loading module openmpi v2.0.1 [retired]. However, changing the sbatch file to bind to socket with only -np 128 tasks resolved this issue

    sbatch file:

    #!/bin/bash
    
    #SBATCH --job-name=linesearch
    #SBATCH --output=ls_%j.txt
    #SBATCH --nodes=16
    #SBATCH --ntasks=128
    #SBATCH --partition=broadwl
    #SBATCH --mem-per-cpu=2000
    #SBATCH --time=18:00:00
    
    # Load the default OpenMPI module.
    module load openmpi
    
    mpiexec -np 128 ./execs/linesearch $1 $2
    

    An alternative solution is to use --bind-to core --map-by core in the mpiexec statement to bind each process to a core