CUDA: Out of memory error on 128 images dataset

I'm trying to train YOLOR on coco128 dataset in Google Colab on coco128 dataset. The training set contains 112 images. The validation set contains 8 images. The testing set contains 8 images.

But, it throws cuda out of memory error. How could it be?? the dataset has only 128 images in total.

Using torch 1.7.0 CUDA:0 (Tesla T4, 15109MB)
Namespace(adam=False, batch_size=8, bucket='', cache_images=False, cfg='cfg/yolor_p6.cfg', data='data/coco128.yaml', device='0', epochs=300, evolve=False, exist_ok=False, global_rank=-1, hyp='./data/hyp.scratch.1280.yaml', image_weights=False, img_size=[1280, 1280], local_rank=-1, log_imgs=16, multi_scale=False, name='yolor_p6', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/yolor_p613', single_cls=False, sync_bn=False, total_batch_size=8, weights='', workers=8, world_size=1)
Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/
2021-07-29 13:35:48.259076: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.5, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}
Model Summary: 665 layers, 37265016 parameters, 37265016 gradients, 81.564040600 GFLOPS
Optimizer groups: 145 .bias, 145 conv.weight, 149 other
Scanning labels ../coco128/train2017.cache3 (110 found, 0 missing, 2 empty, 0 duplicate, for 112 images): 112it [00:00, 11214.18it/s]
Scanning labels ../coco128/val2017.cache3 (8 found, 0 missing, 0 empty, 0 duplicate, for 8 images): 8it [00:00, 4100.00it/s]
NumExpr defaulting to 2 threads.
Image sizes 1280 train, 1280 test
Using 2 dataloader workers
Logging results to runs/train/yolor_p613
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
  0% 0/14 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 539, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 289, in train
    pred = model(imgs)  # forward
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/content/drive/MyDrive/YOLOR/yolor/models/models.py", line 543, in forward
    return self.forward_once(x)
  File "/content/drive/MyDrive/YOLOR/yolor/models/models.py", line 604, in forward_once
    x = module(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/activation.py", line 394, in forward
    return F.silu(input, inplace=self.inplace)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 1741, in silu
    return torch._C._nn.silu(input)
RuntimeError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 14.76 GiB total capacity; 13.70 GiB already allocated; 67.75 MiB free; 13.76 GiB reserved in total by PyTorch)
  0% 0/14 [00:03<?, ?it/s]

Solution

vRAM usage has nothing to do with how many train/val examples there are, but rather model, image size, and batch size. 1280x1280 is a massive image size - on a 16gb GPU you will probably only be able to train at 1 or 2 batch size.

Either use a lower resolution/smaller model, a GPU with more vRAM, or decrease your batch size.

Also try NVIDIA AMP