Search code examples
pythontensorflowtensorflow2.0object-detectionobject-detection-api

TF2 Object Detection API: model_main_tf2.py - validation loss?


I have been trying to train an object detection model for past 2 months and have finally succeeded by following this tutorial.

Here is my colab which contains all my work.

The problem is, the training loss is shown, and it is decreasing on average, but the validation loss is not.

In the pipeline.config file, I did input the evaluation TFRecord file (which I assumed to be the validation data input) , like this:

eval_config {

metrics_set: "coco_detection_metrics"

use_moving_averages: false

}

eval_input_reader {

label_map_path: "annotations/label_map.pbtxt"

shuffle: false

num_epochs: 1

tf_record_input_reader {

input_path: "annotations/test.record"

}

}

and I read through model_main_tf2.py, which does not seem to evaluation while training, but only evaluates when the checkpoint_dir is mentioned.

Hence, I have only been able to monitor the loss on the training set and not the loss on the validation set.

As a result, I have no clue about over or under fitting.

Have any of you managed to use model_main_tf2.py successfully to view validation loss?

Also, it would be nice to see the mAP score with training.

I know keras training allows all these things to be seen on tensorboard, but OD API seems to be much harder.

Thank you for your time, if you are still confused about something please let me know.


Solution

  • You have to open another terminal and run this command

    python model_main_tf2.py \
       --model_dir=models/my_ssd_resnet50_v1_fpn \
       --pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config \
       --checkpoint_dir=models/my_ssd_resnet50_v1_fpn
    

    This API tutorial is unclear on that topic. I had the exact same issue.

    It turns out that the evaluation process is not included in the training loop, you must launch it in parallel.

    It will wait and say waiting for new checkpoint, which means that when you will launch a training with:

    python model_main_tf2.py --model_dir=models/my_ssd_resnet50_v1_fpn --pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config # note that the checkpoint_dir argument is not there
    

    It's going to run the evaluation once every eval_interval_secs in your eval_config.

    According to the documentation, the eval metrics will the be stored next to your checkpoints inside a eval_0 directory, which you will then be able to plot in tensorboard.

    I do agree that this was a bit hard to understand as it is not very clear in the documentation, and is not very convenient as well since I had to allocate another GPU to do the evaluation to avoid the CUDA out of memory issue.

    Have a nice day