I'm trying to use the new "model_main.py" instead of the legacy train.py and eval.py but I'm having issues running them on a tensorflow-gpu with my graphics card (my compute capability is 6.1). Once I run this, it throws an error:
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
Not sure what the problem is, yet it runs without any issue on the CPU version.
The command I'm using:
python model_main.py --pipeline_config_path=train/ssd_mobilenet_v2_coco.config --model_dir=/train --num_train_steps=80000 --num_eval_steps=10 --alsologtostderr
my tensorflow-gpu version is 1.9 .. CUDA 9.0 and cuDNN 7.0 Thanks
EDIT: The full error message>
E:\models-master\research>python object_detection\model_main.py --pipeline_config_path=object_detection\train\ssd_mobilenet_v2_coco.config --model_dir=object_detection\train --num_train_steps=2000 --num_eval_steps=10 --alsologtostderr
WARNING:tensorflow:Forced number of epochs for all eval validations to be 1.
WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered `eval_on_train_input_config.num_epochs` = 0. Overwriting `num_epochs` to 1.
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x000001A76C31D598>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From E:\models-master\research\object_detection\core\preprocessor.py:1240: calling squeeze (from tensorflow.python.ops.array_ops) with squeeze_dims is deprecated and will be removed in a future version.
Instructions for updating:
Use the `axis` argument instead
WARNING:tensorflow:From E:\models-master\research\object_detection\builders\dataset_builder.py:152: batch_and_drop_remainder (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.batch(..., drop_remainder=True)`.
WARNING:root:Variable [FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_2_3x3_s2_512/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 256, 512]], model variable shape: [[3, 3, 256, 512]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_3_3x3_s2_256/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 128, 256]], model variable shape: [[3, 3, 128, 256]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_4_3x3_s2_256/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 128, 256]], model variable shape: [[3, 3, 128, 256]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 64, 128]], model variable shape: [[3, 3, 64, 128]]. This variable will not be initialized from the checkpoint.
2019-05-04 08:06:11.421021: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-05-04 08:06:11.843393: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.455
pciBusID: 0000:26:00.0
totalMemory: 4.00GiB freeMemory: 3.29GiB
2019-05-04 08:06:11.848572: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1484] Adding visible gpu devices: 0
2019-05-04 08:06:13.549559: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-04 08:06:13.552760: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0
2019-05-04 08:06:13.554766: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 0: N
2019-05-04 08:06:13.556851: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3015 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:26:00.0, compute capability: 6.1)
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
File "object_detection\model_main.py", line 109, in <module>
tf.app.run()
File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "object_detection\model_main.py", line 105, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\training.py", line 451, in train_and_evaluate
return executor.run()
File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\training.py", line 590, in run
return self.run_local()
File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\training.py", line 691, in run_local
saving_listeners=saving_listeners)
File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\estimator.py", line 376, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1145, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1173, in _train_model_default
saving_listeners)
File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1451, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 583, in run
run_metadata=run_metadata)
File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1059, in run
run_metadata=run_metadata)
File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1150, in run
raise six.reraise(*original_exc_info)
File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\six.py", line 693, in reraise
raise value
File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1135, in run
return self._sess.run(*args, **kwargs)
File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1215, in run
run_metadata=run_metadata))
File "C:\Users\TheWiz\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 635, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
E:\models-master\research>
Looks like the problem is with the tensorflow-gpu under windows environment. The issue is resolved when I switched to Ubuntu with the latest tensorflow-gpu installed (1.13).
It is noteworthy that I have tried using tensorflow-gpu 1.13 on windows but the model_main.py required some code editing as some of the commands are not recognizable anymore.