My nvidia-smi
output is as follows:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1080 Ti Off| 00000000:02:00.0 Off | N/A |
| 20% 54C P2 83W / 250W| 4692MiB / 11264MiB | 45% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce GTX 1080 Ti Off| 00000000:03:00.0 Off | N/A |
| 26% 60C P2 73W / 250W| 4650MiB / 11264MiB | 44% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce GTX 1080 Ti Off| 00000000:81:00.0 Off | N/A |
| 50% 71C P0 84W / 250W| 0MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce GTX 1080 Ti Off| 00000000:82:00.0 Off | N/A |
| 30% 53C P0 75W / 250W| 0MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3494144 C python 4690MiB |
| 1 N/A N/A 3494896 C python 4648MiB |
+---------------------------------------------------------------------------------------+
I'm running a script to train from scratch a RoBERTa model (based on this article and this notebook), but when I run CUDA_VISIBLE_DEVICES=2,3 python script.py
(this is a machine where other researchers run their scripts; kill the processes on GPU 0 and 1 is not an option), I have the following error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.07 GiB (GPU 0; 10.91 GiB total capacity; 8.36 GiB already allocated; 1.93 GiB free; 8.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Why is only one GPU's RAM being recognized (as seen at 10.91 GiB total capacity
)? By selecting more than one GPU would I not be able to use the total space made available by them? I would like to use this space as it would allow me to have a larger batch size value for training. Because of some time restrictions, I don't intend to use a lower train batch size.
The batch size that you set in torch will be the batch size used by each single GPU. Multi-GPU training allows you to distribute each batch to a different GPU to speed up each epoch, the weights learned by each GPU are then integrated into the resulting model.
So you can't use a bigger batch size just because the training employs more GPUs.