I am using the cons_tres
SLURM plugin, which introduces, among other things, the --gpus-per-task
option. If my understanding is correct, the following script should allocate two distinct GPUs on the same node:
#!/bin/bash
#SBATCH --ntasks=2
#SBATCH --tasks-per-node=2
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:2
#SBATCH --gpus-per-task=1
srun --ntasks=2 --gres=gpu:1 nvidia-smi -L
However, it doesn't, as the output is
GPU 0: Tesla V100-SXM3-32GB (UUID: GPU-c55b3036-d54d-a885-7c6c-4238840c836e)
GPU 0: Tesla V100-SXM3-32GB (UUID: GPU-c55b3036-d54d-a885-7c6c-4238840c836e)
What gives?
Related: https://stackoverflow.com/a/55029430/10260561
Edit
Alternatively, the srun
command could be
srun --ntasks=1 --gres=gpu:1 nvidia-smi -L &
srun --ntasks=1 --gres=gpu:1 nvidia-smi -L &
wait
ie, run the two tasks in parallel, each on 1 GPU. This also doesn't work, and gives
GPU 0: Tesla V100-SXM3-32GB (UUID: GPU-c55b3036-d54d-a885-7c6c-4238840c836e)
srun: Job 627 step creation temporarily disabled, retrying
srun: Step created for job 627
GPU 0: Tesla V100-SXM3-32GB (UUID: GPU-c55b3036-d54d-a885-7c6c-4238840c836e)
Leaving out the extra parameters and calling srun nvidia-smi -L
results in
GPU 0: Tesla V100-SXM3-32GB (UUID: GPU-c55b3036-d54d-a885-7c6c-4238840c836e)
GPU 1: Tesla V100-SXM3-32GB (UUID: GPU-ce697126-4112-a696-ff6b-1b072cdf03a2)
GPU 0: Tesla V100-SXM3-32GB (UUID: GPU-c55b3036-d54d-a885-7c6c-4238840c836e)
GPU 1: Tesla V100-SXM3-32GB (UUID: GPU-ce697126-4112-a696-ff6b-1b072cdf03a2)
ie, 4 tasks are being run?
I need to run two tasks in parallel on distinct GPUs.
This does what I want
srun --gres=gpu:1 bash -c 'CUDA_VISIBLE_DEVICES=$SLURM_PROCID env' | grep CUDA_VISIBLE
CUDA_VISIBLE_DEVICES=1
CUDA_VISIBLE_DEVICES=0
but doesn't make use of --gpus-per-task
.