I am struggling with a basic python script that uses Pytorch to print the CUDA devices on Slurm.
This is the output of sinfo
.
(ml) [s.1915438@sl2 pytorch_gpu_check]$ sinfo -o "%.10P %.5a %.10l %.6D %.6t %.20N %.10G"
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST GRES
compute* up 3-00:00:00 1 drain* scs0123 (null)
compute* up 3-00:00:00 1 down* scs0050 (null)
compute* up 3-00:00:00 120 alloc scs[0001-0009,0011-0 (null)
compute* up 3-00:00:00 1 down scs0010 (null)
developmen up 30:00 1 drain* scs0123 (null)
developmen up 30:00 1 down* scs0050 (null)
developmen up 30:00 120 alloc scs[0001-0009,0011-0 (null)
developmen up 30:00 1 down scs0010 (null)
gpu up 2-00:00:00 2 mix scs[2001-2002] gpu:v100:2
gpu up 2-00:00:00 2 idle scs[2003-2004] gpu:v100:2
accel_ai up 2-00:00:00 1 mix scs2041 gpu:a100:8
accel_ai up 2-00:00:00 4 idle scs[2042-2045] gpu:a100:8
accel_ai_d up 2:00:00 1 mix scs2041 gpu:a100:8
accel_ai_d up 2:00:00 4 idle scs[2042-2045] gpu:a100:8
accel_ai_m up 12:00:00 1 idle scs2046 gpu:1g.5gb
s_highmem_ up 3-00:00:00 1 mix scs0151 (null)
s_highmem_ up 3-00:00:00 1 idle scs0152 (null)
s_compute_ up 3-00:00:00 2 idle scs[3001,3003] (null)
s_compute_ up 1:00:00 2 idle scs[3001,3003] (null)
s_gpu_eng up 2-00:00:00 1 idle scs2021 gpu:v100:4
I've access to accel_ai partition.
This is the Python file I am trying to run.
(ml) [s.1915438@sl2 pytorch_gpu_check]$ cat gpu.py
import torch
print(torch.__version__)
print(f"Is available: {torch.cuda.is_available()}")
try:
print(f"Current Devices: {torch.cuda.current_device()}")
except :
print('Current Devices: Torch is not compiled for GPU or No GPU')
print(f"No. of GPUs: {torch.cuda.device_count()}")
And this is my bash file to submit the job.
(ml) [s.1915438@sl2 pytorch_gpu_check]$ cat check_gpu.sh
#!bin/bash
#SBATCH --nodes=1
#SBATCH --time=00:00:40
#SBATCH --ntasks=1
#SBATCH --job-name=gpu
#SBATCH --output=gpu.%j.out
#SBATCH --error=gpu.%j.err
#SBATCH --mem-per-cpu=10
#SBATCH --gres=gpu:1
#SBATCH --account=scs2045
#SBATCH --partition=accel_ai
module load CUDA/11.3
module load anaconda/3
source activate
conda activate ml
python gpu.py
This is what happends when I run the bash script to submit the job.
(ml) [s.1915438@sl2 pytorch_gpu_check]$ bash check_gpu.sh
1.11.0
Is available: False
Current Devices: Torch is not compiled for GPU or No GPU
No. of GPUs: 0
One thing I would like to make clear is that this Pytorch version comes with CUDA 11.3 from Pytorch's website.
Can anyone tell, what am I doing wrong? Also, here even I exclude these lines the output is the same.
module load CUDA/11.3
module load anaconda/3
source activate
conda activate ml
There is a couple of blunders in my approach. In the job file, the first line should be #!/bin/bash
not #!bin/bash
.
Also, Slurm has a special command SBATCH
to submit your job file. So in order to run your job file, for example check_gpu.sh
, we should use sbatch check_gpu.sh
not bash check_gpu.sh
.
The reason I was getting the following output is that bash thinks #
is a comment.
(ml) [s.1915438@sl2 pytorch_gpu_check]$ bash check_gpu.sh
1.11.0
Is available: False
Current Devices: Torch is not compiled for GPU or No GPU
No. of GPUs: 0
Thus, only the following lines are executed from the job script.
module load CUDA/11.3
module load anaconda/3
source activate
conda activate ml
python gpu.py
After the correction, I ran the job script and it works as expected.
[s.1915438@sl1 pytorch_gpu_check]$ sbatch check_gpu.sh
Submitted batch job 7133028
[s.1915438@sl1 pytorch_gpu_check]$ cat gpu.7133029.out
1.11.0
Is available: True
Current Devices: 0
No. of GPUs: 1
GPU Name:NVIDIA A100-PCIE-40GB
[s.1915438@sl1 pytorch_gpu_check]$ cat gpu.7133029.err