How to tokenize a text using tensorflow?

I am trying to use the following code to vectorize a sentence:

from tensorflow.keras.layers import TextVectorization

text_vectorization_layer =  TextVectorization(max_tokens=10000,
                                              ngrams=5,
                                              standardize='lower_and_strip_punctuation',
                                              output_mode='int',
                                              output_sequence_length = 15
                                              )

text_vectorization_layer(['BlackBerry Limited is a Canadian software'])

However, it complains with the following error:

AttributeError: 'NoneType' object has no attribute 'ndims'

Solution

You have to first compute the vocabulary of the TextVectorization layer using either the adapt method or by passing a vocabulary array to the vocabulary argument of the layer. Here is a working example:

import tensorflow as tf

text_vectorization_layer =  tf.keras.layers.TextVectorization(max_tokens=10000,
                                              ngrams=5,
                                              standardize='lower_and_strip_punctuation',
                                              output_mode='int',
                                              output_sequence_length = 15
                                              )

text_vectorization_layer.adapt(['BlackBerry Limited is a Canadian software'])
print(text_vectorization_layer(['BlackBerry Limited is a Canadian software']))

tf.Tensor([[18  7 11 21 13  2 17  6 10 20 12 16  5  9 19]], shape=(1, 15), dtype=int64)

The strings are tokenized internally. Also, check the docs.