Search code examples
kerastokenize

Include punctuation in keras tokenizer


Is there any way to include punctuation in keras tokenizer?
I would like to have a transformation...

FROM

Tomorrow will be cold.

TO

Index-tomorrow, Index-will,...,Index-point

How can I achieve that?


Solution

  • This is possible if you do some pre-processing on the text.

    First you want to make sure that the punctuation is not filtered out by the Tokenizer. You can see from the documentation that the Tokenizer takes a filter argument on initialization. You can replace the default value with the set of characters you would like to filter, and exclude the ones you want to have in your index.

    The second part is making sure that the punctuation is recognized as its own token. If you tokenize the example sentence the result would take "cold." as a token instead of "cold" and ".". What you need is a seperator between the word and the punctuation. A naive approach is to replace the punctuation in the text with a space + punctuation.

    Following code does what you ask:

    from keras.preprocessing.text import Tokenizer
    
    t = Tokenizer(filters='!"#$%&()*+,-/:;<=>?@[\\]^_`{|}~\t\n') # all without .
    text = "Tomorrow will be cold."
    text = text.replace(".", " .")
    t.fit_on_texts([text])
    print(t.word_index)
    

    -> prints: {'will': 2, 'be': 3, 'cold': 4, 'tomorrow': 1, '.': 5}

    The replace logic can be done in a smarter way (eg. with regex if you want to capture all punctuation), but you get the gist.