I am new to finding my way around multi node datacenters. And the following thing is happening to me.
First I use the program from this answer to check for CUDA devices. I build it (I had some problems there but that is matter for another question) and the executable is called device_info8
.
So I login into my datacenter, and from the login node, I run the file
me@login01 test]$ ./device_info8
Number of devices: 1
Device Number: 0
Device name: Tesla V100-PCIE-16GB
Memory Clock Rate (MHz): 856
Memory Bus Width (bits): 4096
Peak Memory Bandwidth (GB/s): 898.0
Total global memory (Gbytes) 15.8
Shared memory per block (Kbytes) 48.0
minor-major: 0-7
Warp-size: 32
Concurrent kernels: yes
Concurrent computation/communication: yes
I don't have direct access to the node I want to test so I do
me@login01 test]$ srun -p partition1 --nodelist Node-11 --gres=gpu:all --pty -u bash -i
[me@Node-11 test]$
and now I do
[me@Node-11 test]$./device_info8
Number of devices: 0
However when I run nvidia-smi
I can clearly see that I have 8 GPUs available!
[me@Node-11 test]$ nvidia-smi
Tue Dec 3 18:16:04 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:2D:00.0 Off | 0 |
| N/A 28C P0 26W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... On | 00000000:31:00.0 Off | 0 |
| N/A 26C P0 25W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE... On | 00000000:35:00.0 Off | 0 |
| N/A 26C P0 25W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... On | 00000000:39:00.0 Off | 0 |
| N/A 27C P0 24W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-PCIE... On | 00000000:A9:00.0 Off | 0 |
| N/A 26C P0 26W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-PCIE... On | 00000000:AD:00.0 Off | 0 |
| N/A 29C P0 25W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-PCIE... On | 00000000:B1:00.0 Off | 0 |
| N/A 27C P0 24W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-PCIE... On | 00000000:B5:00.0 Off | 0 |
| N/A 28C P0 27W / 250W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Why is this happening and what am I overlooking? How can I make the GPUs available to the program?
The Slurm documentation does not mention the possibility of writing --gres=gpu:all
, and when I do on my system, I get an error. Try specifying an actual number instead of all
and look at the value of the CUDA_VISIBLE_DEVICES
variable. It should not be empty. If it is, it means that Slurm has not understood or honoured the request for GPUs