I followed the T2T Transformer "Train a language model" example and it worked for 10 training step. However, when scaling up to 250,000 steps I get an OutOfRange error (below). Is this a problem with parsing or something else?
INFO:tensorflow:Init TPU system
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
WARNING:tensorflow:
Error occurred during infeed/outfeed. This may be due to a compile error in the main session. Waiting for a short time for the main session to come back.
End of sequence
[[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/split/4"], output_shapes=[[64,1], [64,256,1,1], [64,256], [64,256], [64,256,1,1]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]
Caused by op 'input_pipeline_task0/while/IteratorGetNext', defined at:
File "/usr/local/bin/t2t-trainer", line 32, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/usr/local/bin/t2t-trainer", line 28, in main
t2t_trainer.main(argv)
File "/usr/local/lib/python3.5/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 359, in main
execute_schedule(exp)
...
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 729, in enqueue_ops_fn
features, labels = inputs.features_and_labels()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2766, in features_and_labels
return _Inputs._parse_inputs(self._iterator.get_next())
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 373, in get_next
name=name)), self._output_types,
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1745, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
OutOfRangeError (see above for traceback): End of sequence
[[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/split/4"], output_shapes=[[64,1], [64,256,1,1], [64,256], [64,256], [64,256,1,1]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]
ERROR:tensorflow:Feed error: Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
[[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/split/4"], output_shapes=[[64,1], [64,256,1,1], [64,256], [64,256], [64,256,1,1]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]
During handling of the above exception, another exception occurred:
...
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Step was cancelled by an explicit call to `Session::Close()`.
one of the authors of the Tensor2Tensor library here.
Short answer: reduce --eval_steps
.
Long answer:
Unfortunately, the TPUEstimator
, the library we use under the hood to run on TPU, does not catch OutOfRangeError
when you run out of input data. During training it's not a problem because the input data is infinite (we call repeat on the input tf.data.Dataset
). However, during evaluation, you want to do 1 pass over the data, which means that you need to set --eval_steps
correctly so that you don't exhaust the input data. Hopefully TPUEstimator
will soon handle catching the error so that you don't have to figure out how many eval steps you have to run.