Search code examples
pythontensorflowmachine-learningbert-language-modeltpu

InternalError when using TPU for training Keras model


I am attempting to fine-tune a BERT model on Google Colab from the Tensorflow Hub using this link.

However, I run into the following error:

InternalError: RET_CHECK failure (third_party/tensorflow/core/tpu/graph_rewrite/distributed_tpu_rewrite_pass.cc:2047) arg_shape.handle_type != DT_INVALID  input edge: [id=2693 model_preprocessing_67660:0 -> cluster_train_function:628]

When I run my model.fit(...) function.

This error only occurs when I try to use TPU (runs fine on CPU, but has a very long training time).

Here is my code for setting up the TPU and model:

TPU Setup:

import os
os.environ["TFHUB_MODEL_LOAD_FORMAT"]="UNCOMPRESSED"

cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(cluster_resolver)
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
strategy = tf.distribute.TPUStrategy(cluster_resolver)

Model Setup:

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer('https://tfhub.dev/google/experts/bert/wiki_books/sst2/2', trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
  return tf.keras.Model(text_input, net)

Model Training

with strategy.scope():

  bert_model = build_classifier_model()
  loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
  metrics = tf.metrics.BinaryAccuracy()
  epochs = 1
  steps_per_epoch = 1280000
  num_train_steps = steps_per_epoch * epochs
  num_warmup_steps = int(0.1*num_train_steps)

  init_lr = 3e-5
  optimizer = optimization.create_optimizer(init_lr=init_lr,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')
  bert_model.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics)
  print(f'Training model')
  history = bert_model.fit(x=X_train, y=y_train,
                               validation_data=(X_val, y_val),
                               epochs=epochs)

Note that X_train is a numpy array of type str with shape (1280000,) and y_train is a numpy array of shape (1280000, 1)


Solution

  • As I don't exactly know what changes you have made in the code... I don't have idea about your dataset. But I can see that you are trying to train the whole datset with one epoch and passing the steps per epoch directly. I would recommend to write it like this

    set some batch_size 2^n power (for example 16 or 32 or etc) if you don't want to batch the dataset just set batch_size to 1

    batch_size = 16
    steps_per_epoch = training_data_size // batch_size
    

    The problem with the code is most probably the training dataset size. I think that you're making a mistake by passing the value of the training dataset manually.

    If you're loading the dataset from tfds use (as shown in the link):

    train_dataset, train_data_size = load_dataset_from_tfds(
      in_memory_ds, tfds_info, train_split, batch_size, bert_preprocess_model)
    

    If you're using a custom dataset take the size of the cleaned dataset in a variable and then use that variable for using the size of the training data. Try to avoid manually putting values in the code as far as possible.