python tensorflow keras tensorflow2.0 tensorflow-datasets

Iterating over Tensorflow BatchDataset fails with InvalidArgumentError

I am following the Tensorflow NER example: https://keras.io/examples/nlp/ner_transformers/

However when calling fit()on the model it breaks with the following error:

InvalidArgumentError:  StringToNumberOp could not correctly convert string: 
     [[{{node StringToNumber_1}}]]
     [[IteratorGetNext]] [Op:__inference_train_function_2480]

I have isolated this to the BatchDataset iterator which fails at some point. Why is this failing with the above error when according to the tutorial it should work. I am using Tensorflow 2.7.0 and Keras 2.7.0

The following colab can be used to replicate the error: https://colab.research.google.com/drive/1P1apD3o9I8bclzMN0S0CBUGdkEouzpr2?usp=sharing

Solution

The last two lines in the files conll_train.txt and conll_val.txt are causing the problems since they are not proper entries. If you skip them, everything will work fine. Try this in the code snippet where you create your datasets:

train_data = tf.data.TextLineDataset("./data/conll_train.txt")
train_data = train_data.take(len(list(train_data.map(lambda x: x)))-2)
val_data = tf.data.TextLineDataset("./data/conll_val.txt")
val_data = val_data.take(len(list(val_data.map(lambda x: x)))-2)