Search code examples
tensorflowobject-detection-api

How to train a tensorflow Object Detection using the model-main.py?


I'm trying to use the new "model_main.py" instead of the legacy train.py and eval.py but I'm having issues running them on a tensorflow-gpu with my graphics card (my compute capability is 6.1). Once I run this, it throws an error:
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training. Not sure what the problem is, yet it runs without any issue on the CPU version.

The command I'm using:

python model_main.py --pipeline_config_path=train/ssd_mobilenet_v2_coco.config --model_dir=/train --num_train_steps=80000 --num_eval_steps=10 --alsologtostderr

my tensorflow-gpu version is 1.9 .. CUDA 9.0 and cuDNN 7.0 Thanks

EDIT: The full error message>

E:\models-master\research>python object_detection\model_main.py --pipeline_config_path=object_detection\train\ssd_mobilenet_v2_coco.config --model_dir=object_detection\train --num_train_steps=2000 --num_eval_steps=10 --alsologtostderr
WARNING:tensorflow:Forced number of epochs for all eval validations to be 1.
WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered `eval_on_train_input_config.num_epochs` = 0. Overwriting `num_epochs` to 1.
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x000001A76C31D598>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From E:\models-master\research\object_detection\core\preprocessor.py:1240: calling squeeze (from tensorflow.python.ops.array_ops) with squeeze_dims is deprecated and will be removed in a future version.
Instructions for updating:
Use the `axis` argument instead
WARNING:tensorflow:From E:\models-master\research\object_detection\builders\dataset_builder.py:152: batch_and_drop_remainder (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.batch(..., drop_remainder=True)`.
WARNING:root:Variable [FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_2_3x3_s2_512/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 256, 512]], model variable shape: [[3, 3, 256, 512]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_3_3x3_s2_256/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 128, 256]], model variable shape: [[3, 3, 128, 256]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_4_3x3_s2_256/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 128, 256]], model variable shape: [[3, 3, 128, 256]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 64, 128]], model variable shape: [[3, 3, 64, 128]]. This variable will not be initialized from the checkpoint.
2019-05-04 08:06:11.421021: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-05-04 08:06:11.843393: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.455
pciBusID: 0000:26:00.0
totalMemory: 4.00GiB freeMemory: 3.29GiB
2019-05-04 08:06:11.848572: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1484] Adding visible gpu devices: 0
2019-05-04 08:06:13.549559: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-04 08:06:13.552760: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971]      0
2019-05-04 08:06:13.554766: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 0:   N
2019-05-04 08:06:13.556851: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3015 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:26:00.0, compute capability: 6.1)
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "object_detection\model_main.py", line 109, in <module>
    tf.app.run()
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
    _sys.exit(main(argv))
  File "object_detection\model_main.py", line 105, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\training.py", line 451, in train_and_evaluate
    return executor.run()
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\training.py", line 590, in run
    return self.run_local()
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\training.py", line 691, in run_local
    saving_listeners=saving_listeners)
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\estimator.py", line 376, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1145, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1173, in _train_model_default
    saving_listeners)
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1451, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 583, in run
    run_metadata=run_metadata)
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1059, in run
    run_metadata=run_metadata)
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1150, in run
    raise six.reraise(*original_exc_info)
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\six.py", line 693, in reraise
    raise value
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1135, in run
    return self._sess.run(*args, **kwargs)
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1215, in run
    run_metadata=run_metadata))
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 635, in after_run
    raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

E:\models-master\research>

Solution

  • Looks like the problem is with the tensorflow-gpu under windows environment. The issue is resolved when I switched to Ubuntu with the latest tensorflow-gpu installed (1.13).

    It is noteworthy that I have tried using tensorflow-gpu 1.13 on windows but the model_main.py required some code editing as some of the commands are not recognizable anymore.