Search code examples
tensorflowkerastokenizetext-processing

Keras tokenizer: Keep Numbers as "words"


I am using the keras tokenizer for my text preparation. Now I have x values like 26.07.2020 or 27.September 1993.

I want to use the tokenizer either for adding September as a word to the index, but also 26, or 2020.

I used char_level=True before, but I think the model should perform better with having words like September as word token. Is this possible with the keras tokenizer and if yes, how?

Thanks alot.


Solution

  • You can replace the . with whitespaces, the Tokenizer splits your sentence by whitespaces and then tokenize each word.

    So a simple solution would be

    x.replace('.', ' ')