Search code examples

How to train a tensorflow Object Detection using the

I'm trying to use the new "" instead of the legacy and but I'm having issues running them on a tensorflow-gpu with my graphics card (my compute capability is 6.1). Once I run this, it throws an error: NaN loss during training. Not sure what the problem is, yet it runs without any issue on the CPU version.

The command I'm using:

python --pipeline_config_path=train/ssd_mobilenet_v2_coco.config --model_dir=/train --num_train_steps=80000 --num_eval_steps=10 --alsologtostderr

my tensorflow-gpu version is 1.9 .. CUDA 9.0 and cuDNN 7.0 Thanks

EDIT: The full error message>

E:\models-master\research>python object_detection\ --pipeline_config_path=object_detection\train\ssd_mobilenet_v2_coco.config --model_dir=object_detection\train --num_train_steps=2000 --num_eval_steps=10 --alsologtostderr
WARNING:tensorflow:Forced number of epochs for all eval validations to be 1.
WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered `eval_on_train_input_config.num_epochs` = 0. Overwriting `num_epochs` to 1.
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x000001A76C31D598>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From E:\models-master\research\object_detection\core\ calling squeeze (from tensorflow.python.ops.array_ops) with squeeze_dims is deprecated and will be removed in a future version.
Instructions for updating:
Use the `axis` argument instead
WARNING:tensorflow:From E:\models-master\research\object_detection\builders\ batch_and_drop_remainder (from is deprecated and will be removed in a future version.
Instructions for updating:
Use `, drop_remainder=True)`.
WARNING:root:Variable [FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_2_3x3_s2_512/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 256, 512]], model variable shape: [[3, 3, 256, 512]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_3_3x3_s2_256/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 128, 256]], model variable shape: [[3, 3, 128, 256]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_4_3x3_s2_256/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 128, 256]], model variable shape: [[3, 3, 128, 256]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 64, 128]], model variable shape: [[3, 3, 64, 128]]. This variable will not be initialized from the checkpoint.
2019-05-04 08:06:11.421021: I T:\src\github\tensorflow\tensorflow\core\platform\] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-05-04 08:06:11.843393: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\] Found device 0 with properties:
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.455
pciBusID: 0000:26:00.0
totalMemory: 4.00GiB freeMemory: 3.29GiB
2019-05-04 08:06:11.848572: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\] Adding visible gpu devices: 0
2019-05-04 08:06:13.549559: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-04 08:06:13.552760: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\]      0
2019-05-04 08:06:13.554766: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\] 0:   N
2019-05-04 08:06:13.556851: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3015 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:26:00.0, compute capability: 6.1)
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "object_detection\", line 109, in <module>
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\", line 125, in run
  File "object_detection\", line 105, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\", line 451, in train_and_evaluate
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\", line 590, in run
    return self.run_local()
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\", line 691, in run_local
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\", line 376, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\", line 1145, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\", line 1173, in _train_model_default
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\", line 1451, in _train_with_estimator_spec
    _, loss =[estimator_spec.train_op, estimator_spec.loss])
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\", line 583, in run
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\", line 1059, in run
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\", line 1150, in run
    raise six.reraise(*original_exc_info)
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\", line 693, in reraise
    raise value
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\", line 1135, in run
    return*args, **kwargs)
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\", line 1215, in run
  File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\", line 635, in after_run
    raise NanLossDuringTrainingError NaN loss during training.



  • Looks like the problem is with the tensorflow-gpu under windows environment. The issue is resolved when I switched to Ubuntu with the latest tensorflow-gpu installed (1.13).

    It is noteworthy that I have tried using tensorflow-gpu 1.13 on windows but the required some code editing as some of the commands are not recognizable anymore.