Search code examples
slurmhpc

How to run Pytorch script on Slurm?


I am struggling with a basic python script that uses Pytorch to print the CUDA devices on Slurm.

This is the output of sinfo.

(ml) [s.1915438@sl2 pytorch_gpu_check]$ sinfo -o "%.10P %.5a %.10l %.6D %.6t %.20N %.10G"
 PARTITION AVAIL  TIMELIMIT  NODES  STATE             NODELIST       GRES
  compute*    up 3-00:00:00      1 drain*              scs0123     (null)
  compute*    up 3-00:00:00      1  down*              scs0050     (null)
  compute*    up 3-00:00:00    120  alloc scs[0001-0009,0011-0     (null)
  compute*    up 3-00:00:00      1   down              scs0010     (null)
developmen    up      30:00      1 drain*              scs0123     (null)
developmen    up      30:00      1  down*              scs0050     (null)
developmen    up      30:00    120  alloc scs[0001-0009,0011-0     (null)
developmen    up      30:00      1   down              scs0010     (null)
       gpu    up 2-00:00:00      2    mix       scs[2001-2002] gpu:v100:2
       gpu    up 2-00:00:00      2   idle       scs[2003-2004] gpu:v100:2
  accel_ai    up 2-00:00:00      1    mix              scs2041 gpu:a100:8
  accel_ai    up 2-00:00:00      4   idle       scs[2042-2045] gpu:a100:8
accel_ai_d    up    2:00:00      1    mix              scs2041 gpu:a100:8
accel_ai_d    up    2:00:00      4   idle       scs[2042-2045] gpu:a100:8
accel_ai_m    up   12:00:00      1   idle              scs2046 gpu:1g.5gb
s_highmem_    up 3-00:00:00      1    mix              scs0151     (null)
s_highmem_    up 3-00:00:00      1   idle              scs0152     (null)
s_compute_    up 3-00:00:00      2   idle       scs[3001,3003]     (null)
s_compute_    up    1:00:00      2   idle       scs[3001,3003]     (null)
s_gpu_eng    up 2-00:00:00      1   idle              scs2021 gpu:v100:4

I've access to accel_ai partition.

This is the Python file I am trying to run.

(ml) [s.1915438@sl2 pytorch_gpu_check]$ cat gpu.py 
import torch
print(torch.__version__)
print(f"Is available: {torch.cuda.is_available()}")

try:
    print(f"Current Devices: {torch.cuda.current_device()}")
except :
    print('Current Devices: Torch is not compiled for GPU or No GPU')

print(f"No. of GPUs: {torch.cuda.device_count()}")

And this is my bash file to submit the job.

(ml) [s.1915438@sl2 pytorch_gpu_check]$ cat check_gpu.sh 
#!bin/bash
#SBATCH --nodes=1
#SBATCH --time=00:00:40
#SBATCH --ntasks=1
#SBATCH --job-name=gpu
#SBATCH --output=gpu.%j.out
#SBATCH --error=gpu.%j.err
#SBATCH --mem-per-cpu=10
#SBATCH --gres=gpu:1
#SBATCH --account=scs2045
#SBATCH --partition=accel_ai

module load CUDA/11.3
module load anaconda/3
source activate
conda activate ml
python gpu.py

This is what happends when I run the bash script to submit the job.

(ml) [s.1915438@sl2 pytorch_gpu_check]$ bash check_gpu.sh 
1.11.0
Is available: False
Current Devices: Torch is not compiled for GPU or No GPU
No. of GPUs: 0

One thing I would like to make clear is that this Pytorch version comes with CUDA 11.3 from Pytorch's website.

Can anyone tell, what am I doing wrong? Also, here even I exclude these lines the output is the same.

module load CUDA/11.3
module load anaconda/3
source activate
conda activate ml

Solution

  • There is a couple of blunders in my approach. In the job file, the first line should be #!/bin/bash not #!bin/bash.

    Also, Slurm has a special command SBATCH to submit your job file. So in order to run your job file, for example check_gpu.sh, we should use sbatch check_gpu.sh not bash check_gpu.sh.

    The reason I was getting the following output is that bash thinks # is a comment.

    (ml) [s.1915438@sl2 pytorch_gpu_check]$ bash check_gpu.sh 
    1.11.0
    Is available: False
    Current Devices: Torch is not compiled for GPU or No GPU
    No. of GPUs: 0
    

    Thus, only the following lines are executed from the job script.

    module load CUDA/11.3
    module load anaconda/3
    source activate
    conda activate ml
    python gpu.py
    

    After the correction, I ran the job script and it works as expected.

    [s.1915438@sl1 pytorch_gpu_check]$ sbatch check_gpu.sh 
    Submitted batch job 7133028
    [s.1915438@sl1 pytorch_gpu_check]$ cat gpu.7133029.out 
    1.11.0
    Is available: True
    Current Devices: 0
    No. of GPUs: 1
    GPU Name:NVIDIA A100-PCIE-40GB
    [s.1915438@sl1 pytorch_gpu_check]$ cat gpu.7133029.err