Search code examples

How To Run MPI Python Script across multiple nodes on Slurm cluster? Error: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1

I'm running a script on a Slurm cluster that could benefit from parallel processing, so I'm trying to implement MPI. However, it doesn't seem to allow me to run processes on multiple nodes. I don't know if this is normally done automatically, but whenever I set --nodes=2 in the batch file for submission, I get the error message:

"Warning: can't run 1 processes on 2 nodes, setting nnodes to 1."

I've been trying to get it to work with a simple Hello World script, but still run into the above error. I added --oversubscribe to the options when I run the MPI script, but still get this error.

#SBATCH --job-name=a_test
#SBATCH --mail-type=ALL
#SBATCH --ntasks=1
#SBATCH --cpu-freq=high
#SBATCH --nodes=2
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=1gb
#SBATCH --mem-bind=verbose,local
#SBATCH --time=01:00:00
#SBATCH --output=out_%x.log

module load python/3.6.2
mpirun -np 4 --oversubscribe python

I still get the expected output, but only after the error message:

"Warning: can't run 1 process on 2 nodes, setting nnodes to 1."

I'm worried that without being able to run on multiple nodes, my actual script will be a lot slower.


  • The reason for the warning is this line:

    #SBATCH --ntasks=1

    where you're specifying that you're going to run only 1 mpi process, just before you request 2 nodes.

    --ntasks sets the number of processes to run/ranks to use in your case. You then overwrite it with an equivalent -n which is why you're seeing the result.

    For your reference, this is the script I run on my system,

    #SBATCH -C knl 
    #SBATCH -q regular
    #SBATCH -t 00:10:00
    #SBATCH --nodes=2
    module load python3
    srun -n 4 python >& py_${SLURM_JOB_ID}.log
    echo $ELAPSED_TIME

    Performance notes:

    • It's faster to run your code on the same node if possible. Internode communication is slower than within a node, it may be a bit slower but may also be much much slower which depends on things like cluster architecture.
    • Consult your cluster settings recommendations. For instance on mine I should be adding certain slurm options to this script - specifically -c and cpu_bind= (more here).