In Keras we have keras.preprocessing.text to tokenize the text on our requirement and generate a voabulary.
tokenizer = tf.keras.preprocessing.text.Tokenizer(split=' ', oov_token=1)
tokenizer.fit_on_texts(["Hello world"])
seqs = tokenizer.texts_to_sequences(["Hello world"])
What I am not sure is whether to add End of Sequence (EOS) tag and Beginning of Sequence (BOS) tags explicitly if we are feeding the generated seqs to a neural network like RNN after padding the seq to fixed length. Or, does Keras do it for us? (I have not seen any example of adding EOS and BOS explicitly when using Keras tokenizer)
No, it is not required to add <EOS>
<BOS>
for the tf.keras.preprocessing.text.Tokenizer
Since the index_word
mapping works in the order starting with the oov_token
and next preference is for the words having highest frequency and followed by words in the same order as the input.
This helps Keras API to process the mapping internally unlike other text preprocessing API's which uses <START>
and <END>
tag.
Below is the example with the sample sentences to show index_word
mapping.
text_data = ["this is the sample sentence",
"one more sentence"]
lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token="<UNK>")
lang_tokenizer.fit_on_texts(text_data)
lang_tokenizer.index_word
index_word:
{1: '<UNK>',
2: 'sentence',
3: 'this',
4: 'is',
5: 'the',
6: 'sample',
7: 'one',
8: 'more'}
Testing:
res = lang_tokenizer.texts_to_sequences(["testing with sample sentence"])
[[1, 1, 6, 2]]
Hope this answeres your question, Happy Learning!