Search code examples
pythonmachine-learningnanyolodarknet

Getting -NaN during DARKNET training, what am i doing wrong?


I want to train YOLOv3 to detect humans on aerial pictures. Im using VisDrone Object Detection in Images dataset: github.com/VisDrone/VisDrone-Dataset

I wrote a script that converted labels to darknet format so that i can train it according to pjreddie "Training YOLO on COCO" instructions, I double checked if my converted labels match the objects correctly and they do, I also created a proper coco.names file according to labels description on VisDrone2018-DET-toolkit on github. I created trainvalno5k.txt file by running

python 5kGenerator.py > trainvalno5k.txt

5kGenerator.py:

import os

for filename in os.listdir('images'):
    print( os.path.abspath( os.path.join( 'images', filename )))

I modified coco.data file, this is the result:

classes= 12
train  = /mnt/d/Olaf/Documents/Python/VisDrone2019-DET-train/trainvalno5k.txt
#valid  = /mnt/d/Olaf/Documents/Python/VisDrone2019-DET-train/5k.txt
#valid = data/coco_val_5k.list
names  = /mnt/d/Olaf/Documents/Python/VisDrone2019-DET-train/coco.names
backup = backup
#eval=coco

I commented valid out because as far as I understand its for checking results and valid dataset is irrelevant for training (I didn't bother to create it).

When I run ./darknet detector train cfg/coco.data cfg/yolov3.cfg darknet53.conv.74 stuff loads correctly and training starts, but every few lines i get -nan messages and I have no idea why and if that has impact on the end result, example:

Loading weights from darknet53.conv.74...Done!
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005
Resizing
416
Loaded: 1.122782 seconds
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.428162, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.409795, Class: 0.690346, Obj: 0.091164, No Obj: 0.519810, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 106 Avg IOU: 0.157575, Class: 0.532119, Obj: 0.333807, No Obj: 0.417611, .5R: 0.045685, .75R: 0.000000,  count: 197
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.427261, .5R: -nan, .75R: -nan,  count: 0

Its pretty slow because im testing this on CPU, proper training will be done on Nvidia Quadro

Can you please explain this behaviour and what can i do to fix that -nan problem?

Ps. Im using Ubuntu terminal on Windows 10, I dont know if thats important.


Solution

    1. It's better to using AlexeyAB repository for training.

    2. You should use Validation set or test for evaluation of trained networks on your data.

    3. I have trained a 26 classes dataset and I ignored 5k classes & you have 12 classes.

    4. for Nan value it's better to decrease Learning Rate in starting of training. and then increasing it.

    5. you could train your network in windows & Linux and it's not matter.