Search code examples
pythontensorflowtensor2tensor

Cannot restore from checkpoint: bidirectional/backward_lstm/bias


I am trying to create a simple LSTM-based RNN in tensor2tensor.

The training seems to work so far but I cannot restore the model. Trying to do so will throw a NotFoundError pointing out a bias-node from the LSTM:

NotFoundError: .. 

Key bidirectional/backward_lstm/bias not found in checkpoint

and I don't know why this is the case.

This was actually supposed to be a workaround for another issue where I can into a similar issue using an LSTM from tensor2tensor (https://github.com/tensorflow/tensor2tensor/issues/1616).

Environment

$ pip freeze | grep tensor
mesh-tensorflow==0.0.5
tensor2tensor==1.12.0
tensorboard==1.12.0
tensorflow-datasets==1.0.2
tensorflow-estimator==1.13.0
tensorflow-gpu==1.12.0
tensorflow-metadata==0.9.0
tensorflow-probability==0.5.0

Model body

def body(self, features):

    inputs = features['inputs'][:,:,0,:]

    hparams = self._hparams
    problem = hparams.problem
    encoders = problem.feature_info

    max_input_length = 350
    max_output_length = 350 

    encoder = Bidirectional(LSTM(128, return_sequences=True, unroll=False), merge_mode='concat')(inputs)
    encoder_last = encoder[:, -1, :]

    decoder = LSTM(256, return_sequences=True, unroll=False)(inputs, initial_state=[encoder_last, encoder_last])

    attention = dot([decoder, encoder], axes=[2, 2])
    attention = Activation('softmax', name='attention')(attention)

    context = dot([attention, encoder], axes=[2, 1])
    concat = concatenate([context, decoder])

    return tf.expand_dims(concat, 2)

Full error

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key while/lstm_keras/parallel_0_4/lstm_keras/lstm_keras/body/bidirectional/backward_lstm/bias not found in checkpoint
     [[node save/RestoreV2 (defined at /home/sfalk/tmp/pycharm_project_265/asr/model/persistence.py:282)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Any what the issue might be and how to fix this?


Solution

  • This seems to be related to https://github.com/tensorflow/tensor2tensor/issues/1486. "while" seems to be prepended to keynames during restoration from a checkpoint using tensor2tensor. Seems to be an unaddressed bug, your input would be appreciated on github.

    I would comment this if I could, but my reputation is too low. Cheers.