tensorflow keras tokenize text-processing

Keras tokenizer: Keep Numbers as "words"

I am using the keras tokenizer for my text preparation. Now I have x values like 26.07.2020 or 27.September 1993.

I want to use the tokenizer either for adding September as a word to the index, but also 26, or 2020.

I used char_level=True before, but I think the model should perform better with having words like September as word token. Is this possible with the keras tokenizer and if yes, how?

Thanks alot.

Solution

You can replace the . with whitespaces, the Tokenizer splits your sentence by whitespaces and then tokenize each word.

So a simple solution would be

x.replace('.', ' ')