BERT fine tuned model for sentiment analysis highly over-fitting

I am trying to fine tune a BERT pre-trained model. I am working with the yelp_polarity_reviews data from tensorflow_datasets. I have made sure:

  1. To load the pre-trained BERT model as KerasLayer with tensorflow_hub.
  2. To use the same tokenizer, vocab_file and do_lower_case which were used in training the original model.
  3. Convert the dataset to object and apply map function with wrapping my python function in tf.py_function.
  4. I'm also supplying the input as BERT wants i.e., input_word_ids, input_mask and input_type_ids in an array.

After making sure all the above is implemented correctly, while training the model overfits badly. The training accuracy goes up to ~99% while the validation accuracy barely crosses 50% mark.

I have tried different optimizers, error functions, learning rates, even tried with high as well as low dropouts and I've also tried with altering the size of train data but after all this the result is no better.

Here is the colab notebook that shows the executed code.

Any suggestions and help would be highly appreciated.


  • I checked your colab code and with a few trails, it appeared that there was an issue on the validation set. And it was right of course. The mistake was to load the train labels in the test data set.

    elp_test, _ = train_test_split(list(zip(yelp['test']['text'].numpy(),
                                    yelp['test']['label'].numpy())), # < correction

    Now, if you run the model, you will get

    history =, 
    915ms/step - loss: 0.3309 - binary_accuracy: 0.8473 - 
                 val_loss: 0.1722 - val_binary_accuracy: 0.9354