tensorflow dataset tensorboard tensorflow-datasets

Invalid argument: Nan in summary histogram by editing the number of labels

I have decreased the deflault number of labels from 19 to 10 of dataset cityscapes. My goal is to change the dataset so the decoder need to relearn the weights, as an preperation-exercise of increasing the output classes of the decoder.

The network I am using is deeplab, the trainning process is fine at first. About 500 steps were run before the error.

(The code below doesn't start from the first line after the start of training)

I1111 16:19:23.461441 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.82067
Total loss is :[6.42209053]
INFO:tensorflow:global_step/sec: 1.84064
I1111 16:19:28.894436 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.84064
Total loss is :[6.23576546]
INFO:tensorflow:global_step/sec: 1.84368
I1111 16:19:34.318257 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.84368
Total loss is :[6.09628582]
INFO:tensorflow:global_step/sec: 1.83645
I1111 16:19:39.763585 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.83645
Total loss is :[6.20008707]
INFO:tensorflow:global_step/sec: 1.84192
I1111 16:19:45.192930 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.84192
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
     [[{{node image_pooling/BatchNorm/moving_variance_1}}]]
     [[Mean_225/_10177]]
  (1) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
     [[{{node image_pooling/BatchNorm/moving_variance_1}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zwang/workspace//models-master/research/deeplab/train.py", line 521, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/zwang/workspace//models-master/research/deeplab/train.py", line 515, in main
    sess.run([train_tensor])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1252, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1353, in run
    raise six.reraise(*original_exc_info)
  File "/home/zwang/.local/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1338, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1411, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1169, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
     [[node image_pooling/BatchNorm/moving_variance_1 (defined at home/zwang/workspace//models-master/research/deeplab/train.py:328) ]]
     [[Mean_225/_10177]]
  (1) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
     [[node image_pooling/BatchNorm/moving_variance_1 (defined at home/zwang/workspace//models-master/research/deeplab/train.py:328) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node image_pooling/BatchNorm/moving_variance_1:
 image_pooling/BatchNorm/moving_variance/read (defined at home/zwang/workspace/models-master/research/deeplab/model.py:478)

Input Source operations connected to node image_pooling/BatchNorm/moving_variance_1:
 image_pooling/BatchNorm/moving_variance/read (defined at home/zwang/workspace/models-master/research/deeplab/model.py:478)

Original stack trace for 'image_pooling/BatchNorm/moving_variance_1':
  File "home/zwang/workspace//models-master/research/deeplab/train.py", line 521, in <module>
    tf.app.run()
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "home/zwang/workspace//models-master/research/deeplab/train.py", line 472, in main
    dataset.ignore_label)
  File "home/zwang/workspace//models-master/research/deeplab/train.py", line 379, in _train_deeplab_model
    reuse_variable=(i != 0))
  File "home/zwang/workspace//models-master/research/deeplab/train.py", line 275, in _tower_loss
    _build_deeplab(iterator, {common.OUTPUT_TYPE: num_of_classes}, ignore_label)
  File "home/zwang/workspace//models-master/research/deeplab/train.py", line 257, in _build_deeplab
    output_type_dict[model.MERGED_LOGITS_SCOPE])
  File "home/zwang/workspace//models-master/research/deeplab/train.py", line 328, in _log_summaries
    tf.summary.histogram(model_var.op.name, model_var)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/summary/summary.py", line 179, in histogram
    tag=tag, values=values, name=scope)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 329, in histogram_summary
    "HistogramSummary", tag=tag, values=values, name=name)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

I think the error

  (0) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1

seems like an error of tensorboard, is there some way to avoid it?

Since my training has run 500 steps out of 30000 steps without any problem. I am hoping that without some part of the function (like histogram of tensorboard), or by editing the num_of_labels somewhere else _(maybe there is another parameter of the_num_of_classes may need editing)_, the trainning process would run properly.

Could you give some suggestions either direkt to this error, or to my general approach? Thanks

Best Regards

Zhe

Solution

The problem was solved be adjusting the hyper-parameters for training, like decreasing the learning rate to stabilize the training process.