Search code examples
pythontensorflowkerasnlptokenize

How to tokenize punctuations using the Tokenizer function tensorflow


I use the Tokenizer() function from tensorflow.keras.preprocessing.text as :

from tensorflow.keras.preprocessing.text import Tokenizer
s = ["The quick brown fox jumped over the lazy dog."]
t = Tokenizer()
t.fit_on_texts(s)
print(t.word_index)

Output :

{'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumped': 5, 'over': 6, 'lazy': 7, 'dog': 8}

The Tokenizer function excludes the punctuation. How to tokenize the punctuation also? (., in this example. )


Solution

  • A possibility is to separate the punctuations from the words with spaces. I do this with a preprocess function pad_punctuation. after this I apply Tokenizer with filter=''

    import re
    import string
    from tensorflow.keras.preprocessing.text import Tokenizer
    
    def pad_punctuation(s): return re.sub(f"([{string.punctuation}])", r' \1 ', s)
    
    S = ["The quick brown fox jumped over the lazy dog."]
    S = [pad_punctuation(s) for s in S]
    
    t = Tokenizer(filters='')
    t.fit_on_texts(S)
    print(t.word_index)
    

    result:

    {'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumped': 5, 'over': 6, 'lazy': 7, 'dog': 8, '.': 9}
    

    The pad_punctuation function is effective with all the punctuations