I have decreased the deflault number of labels from 19 to 10 of dataset cityscapes. My goal is to change the dataset so the decoder need to relearn the weights, as an preperation-exercise of increasing the output classes of the decoder.
The network I am using is deeplab, the trainning process is fine at first. About 500 steps were run before the error.
(The code below doesn't start from the first line after the start of training)
I1111 16:19:23.461441 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.82067
Total loss is :[6.42209053]
INFO:tensorflow:global_step/sec: 1.84064
I1111 16:19:28.894436 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.84064
Total loss is :[6.23576546]
INFO:tensorflow:global_step/sec: 1.84368
I1111 16:19:34.318257 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.84368
Total loss is :[6.09628582]
INFO:tensorflow:global_step/sec: 1.83645
I1111 16:19:39.763585 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.83645
Total loss is :[6.20008707]
INFO:tensorflow:global_step/sec: 1.84192
I1111 16:19:45.192930 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.84192
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[{{node image_pooling/BatchNorm/moving_variance_1}}]]
[[Mean_225/_10177]]
(1) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[{{node image_pooling/BatchNorm/moving_variance_1}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zwang/workspace//models-master/research/deeplab/train.py", line 521, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/zwang/workspace//models-master/research/deeplab/train.py", line 515, in main
sess.run([train_tensor])
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1252, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1353, in run
raise six.reraise(*original_exc_info)
File "/home/zwang/.local/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1338, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1411, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1169, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[node image_pooling/BatchNorm/moving_variance_1 (defined at home/zwang/workspace//models-master/research/deeplab/train.py:328) ]]
[[Mean_225/_10177]]
(1) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
[[node image_pooling/BatchNorm/moving_variance_1 (defined at home/zwang/workspace//models-master/research/deeplab/train.py:328) ]]
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node image_pooling/BatchNorm/moving_variance_1:
image_pooling/BatchNorm/moving_variance/read (defined at home/zwang/workspace/models-master/research/deeplab/model.py:478)
Input Source operations connected to node image_pooling/BatchNorm/moving_variance_1:
image_pooling/BatchNorm/moving_variance/read (defined at home/zwang/workspace/models-master/research/deeplab/model.py:478)
Original stack trace for 'image_pooling/BatchNorm/moving_variance_1':
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 521, in <module>
tf.app.run()
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 472, in main
dataset.ignore_label)
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 379, in _train_deeplab_model
reuse_variable=(i != 0))
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 275, in _tower_loss
_build_deeplab(iterator, {common.OUTPUT_TYPE: num_of_classes}, ignore_label)
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 257, in _build_deeplab
output_type_dict[model.MERGED_LOGITS_SCOPE])
File "home/zwang/workspace//models-master/research/deeplab/train.py", line 328, in _log_summaries
tf.summary.histogram(model_var.op.name, model_var)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/summary/summary.py", line 179, in histogram
tag=tag, values=values, name=scope)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 329, in histogram_summary
"HistogramSummary", tag=tag, values=values, name=name)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
I think the error
(0) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
seems like an error of tensorboard, is there some way to avoid it?
Since my training has run 500 steps out of 30000 steps without any problem. I am hoping that without some part of the function (like histogram of tensorboard), or by editing the num_of_labels somewhere else _(maybe there is another parameter of the_num_of_classes may need editing)_, the trainning process would run properly.
Could you give some suggestions either direkt to this error, or to my general approach? Thanks
Best Regards
Zhe
The problem was solved be adjusting the hyper-parameters for training, like decreasing the learning rate to stabilize the training process.