Search code examples
pythonpytorchhuggingface-transformers

CUDA out of memory when training is done on multiple GPU


My nvidia-smi output is as follows:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti      Off| 00000000:02:00.0 Off |                  N/A |
| 20%   54C    P2               83W / 250W|   4692MiB / 11264MiB |     45%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce GTX 1080 Ti      Off| 00000000:03:00.0 Off |                  N/A |
| 26%   60C    P2               73W / 250W|   4650MiB / 11264MiB |     44%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce GTX 1080 Ti      Off| 00000000:81:00.0 Off |                  N/A |
| 50%   71C    P0               84W / 250W|      0MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce GTX 1080 Ti      Off| 00000000:82:00.0 Off |                  N/A |
| 30%   53C    P0               75W / 250W|      0MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3494144      C   python                                     4690MiB |
|    1   N/A  N/A   3494896      C   python                                     4648MiB |
+---------------------------------------------------------------------------------------+

I'm running a script to train from scratch a RoBERTa model (based on this article and this notebook), but when I run CUDA_VISIBLE_DEVICES=2,3 python script.py (this is a machine where other researchers run their scripts; kill the processes on GPU 0 and 1 is not an option), I have the following error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.07 GiB (GPU 0; 10.91 GiB total capacity; 8.36 GiB already allocated; 1.93 GiB free; 8.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Why is only one GPU's RAM being recognized (as seen at 10.91 GiB total capacity)? By selecting more than one GPU would I not be able to use the total space made available by them? I would like to use this space as it would allow me to have a larger batch size value for training. Because of some time restrictions, I don't intend to use a lower train batch size.


Solution

  • The batch size that you set in torch will be the batch size used by each single GPU. Multi-GPU training allows you to distribute each batch to a different GPU to speed up each epoch, the weights learned by each GPU are then integrated into the resulting model.
    So you can't use a bigger batch size just because the training employs more GPUs.