Search code examples
pytorchgoogle-colaboratorytraining-datayolov5

PyTorch Training exitting after Caching Images


I have a dataset of around 12k Training Images and 500 Validation Images. I am using YOLOv5-PyTorch to train my model. When i start the training, and when it comes down to the Caching Images stage, it suddenly quits.

The code I'm using to run this is as follows:

!python train.py --img 800 --batch 32 --epochs 20 --data '/content/data.yaml' --cfg ./models/custom_yolov5s.yaml --weights yolov5s.pt --name yolov5s_results  --cache

I am using Google Colab to train my model.

This is the command that executes before shutting down:

train: Caching Images (12.3GB ram): 99% 11880/12000 [00:47<00:00, 94.08it/s]


Solution

  • So i solved the above problem. The problem is occuring because we are caching all the images fore-hand as to increase the speed during epochs. Now this may increase the speed but on the other hand, it also consumes memory. When you are using Google Colab, it provides you 12.69GB of RAM. When caching such huge data, all of the RAM was being consumed and there was nothing left to cache validation set hence, it shuts down immediately. There are two basic methods to solve this issue:

    Method 1:

    I simply reduced the image size from 800 to 640 as my training images didn't contain any small object, so i actually did not need large sized images. It reduced my RAM consumption by 50%

    --img 640
    

    train: Caching Images (6.6GB ram): 100% 12000/12000 [00:30<00:00, 254.08it/s]

    Method 2:

    I had written an argument at the end of my command that I'm using to run this project :

    --cache
    

    This command caches the entire dataset in the first epoch so it may be reused again instantly instead of processing it again. If you are willing to compromise on training speed, then this method would work for you. Just simply remove this line and you will be good to go. Your new command to run will be:

    !python train.py --img 800 --batch 32 --epochs 20 --data '/content/data.yaml' --cfg ./models/custom_yolov5s.yaml --weights yolov5s.pt --name yolov5s_results