python tensorflow keras recurrent-neural-network vocabulary

Should <EOS> and <BOS> tags be explictly added to vocabulary after using keras.preprocessing.text Tokenizer?

In Keras we have keras.preprocessing.text to tokenize the text on our requirement and generate a voabulary.

tokenizer = tf.keras.preprocessing.text.Tokenizer(split=' ',  oov_token=1)
tokenizer.fit_on_texts(["Hello world"])
seqs = tokenizer.texts_to_sequences(["Hello world"])

What I am not sure is whether to add End of Sequence (EOS) tag and Beginning of Sequence (BOS) tags explicitly if we are feeding the generated seqs to a neural network like RNN after padding the seq to fixed length. Or, does Keras do it for us? (I have not seen any example of adding EOS and BOS explicitly when using Keras tokenizer)

Solution

No, it is not required to add <EOS> <BOS> for the tf.keras.preprocessing.text.Tokenizer
Since the index_word mapping works in the order starting with the oov_token and next preference is for the words having highest frequency and followed by words in the same order as the input. This helps Keras API to process the mapping internally unlike other text preprocessing API's which uses <START> and <END> tag.

Below is the example with the sample sentences to show index_word mapping.

text_data = ["this is the sample sentence",
            "one more sentence"]

lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token="<UNK>")
lang_tokenizer.fit_on_texts(text_data)
lang_tokenizer.index_word

index_word:

{1: '<UNK>',
 2: 'sentence',
 3: 'this',
 4: 'is',
 5: 'the',
 6: 'sample',
 7: 'one',
 8: 'more'}

Testing:

res = lang_tokenizer.texts_to_sequences(["testing with sample sentence"])

[[1, 1, 6, 2]]

Hope this answeres your question, Happy Learning!