Search code examples
deep-learningcomputer-visiongoogle-colaboratoryobject-detectionyolo

YOLOv4 reports 30 hour training time on Colab Pro with only 340 training images


I am trying to test my model on Colab Pro and I'm only using 340 training images with 16 classes just for testing. However, Colab Pro tells me that there is about 30 hours of training time left:

(next mAP calculation at 1200 iterations) 
 Last accuracy [email protected] = 0.37 %, best = 0.37 % 
 1187: 3.270728, 3.027621 avg loss, 0.010000 rate, 1.429193 seconds, 75968 images, 30.824708 hours left
Loaded: 1.136631 seconds - performance bottleneck on CPU or Disk HDD/SSD
...
...
...
 (next mAP calculation at 1300 iterations) 
 Last accuracy [email protected] = 0.33 %, best = 0.37 % 
 1278: 3.231166, 2.967602 avg loss, 0.010000 rate, 2.552415 seconds, 81792 images, 30.512658 hours left
Loaded: 0.712928 seconds - performance bottleneck on CPU or Disk HDD/SSD

I don't know why it is doing this. I only have a small dataset.

Here are my cnfg parameters:

[net]
# Testing
#batch=1
#subdivisions=1
# Training
batch=64
subdivisions=16
width=1024
height=1024
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
 
learning_rate=0.01
burn_in=1000
max_batches = {max_batches}
policy=steps
steps={steps_str}
scales=.1,.1
 
[convolutional]
batch_normalize=1
filters=32
size=3
stride=2
pad=1
activation=leaky
 
[convolutional]
batch_normalize=1
filters=64
size=3
stride=2
pad=1
activation=leaky
 
[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky
 
[route]
layers=-1
groups=2
group_id=1
 
[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky
 
[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky
 
[route]
layers = -1,-2
 
[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky
 
[route]
layers = -6,-1
 
[maxpool]
size=2
stride=2
 
[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky
 
[route]
layers=-1
groups=2
group_id=1
 
[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky
 
[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky
 
[route]
layers = -1,-2
 
[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky
 
[route]
layers = -6,-1
 
[maxpool]
size=2
stride=2
 
[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky
 
[route]
layers=-1
groups=2
group_id=1
 
[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky
 
[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky
 
[route]
layers = -1,-2
 
[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky
 
[route]
layers = -6,-1
 
[maxpool]
size=2
stride=2
 
[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky
 
##################################
 
[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky
 
[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky
 
[convolutional]
size=1
stride=1
pad=1
filters={num_filters}
activation=linear
 
 
 
[yolo]
mask = 3,4,5
anchors = 10,14,  23,27,  37,58,  81,82,  135,169,  344,319
classes={num_classes}
num=6
jitter=.3
scale_x_y = 1.05
cls_normalizer=1.0
truth_thresh = 1
random=1
nms_kind=greedynms
beta_nms=0.6
ignore_thresh = .9 
iou_normalizer=0.5 
iou_loss=giou
 
[route]
layers = -4
 
[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky
 
[upsample]
stride=2
 
[route]
layers = -1, 23
 
[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky
 
[convolutional]
size=1
stride=1
pad=1
filters={num_filters}
activation=linear
 
[yolo]
mask = 1,2,3
anchors = 10,14,  23,27,  37,58,  81,82,  135,169,  344,319
classes={num_classes}
num=6
jitter=.3
scale_x_y = 1.05
cls_normalizer=1.0
ignore_thresh = .9 
iou_normalizer=0.5
iou_loss=giou
truth_thresh = 1
random=1
nms_kind=greedynms
beta_nms=0.6

Solution

  • Your training depends on the max_batches parameter, which is the maximum number of batches, basically.

    Based on this repo's suggestion, max_batches should be classes*2000. So in your case, it's 16*2000=32,000. That's why it takes more time despite the small dataset.