Search code examples
pythontensorflowkerastexttokenize

How to tokenize a text using tensorflow?


I am trying to use the following code to vectorize a sentence:

from tensorflow.keras.layers import TextVectorization

text_vectorization_layer =  TextVectorization(max_tokens=10000,
                                              ngrams=5,
                                              standardize='lower_and_strip_punctuation',
                                              output_mode='int',
                                              output_sequence_length = 15
                                              )

text_vectorization_layer(['BlackBerry Limited is a Canadian software'])

However, it complains with the following error:

AttributeError: 'NoneType' object has no attribute 'ndims'


Solution

  • You have to first compute the vocabulary of the TextVectorization layer using either the adapt method or by passing a vocabulary array to the vocabulary argument of the layer. Here is a working example:

    import tensorflow as tf
    
    text_vectorization_layer =  tf.keras.layers.TextVectorization(max_tokens=10000,
                                                  ngrams=5,
                                                  standardize='lower_and_strip_punctuation',
                                                  output_mode='int',
                                                  output_sequence_length = 15
                                                  )
    
    text_vectorization_layer.adapt(['BlackBerry Limited is a Canadian software'])
    print(text_vectorization_layer(['BlackBerry Limited is a Canadian software']))
    
    tf.Tensor([[18  7 11 21 13  2 17  6 10 20 12 16  5  9 19]], shape=(1, 15), dtype=int64)
    

    The strings are tokenized internally. Also, check the docs.