python tensorflow machine-learning keras tokenize

What is Keras tokenizer.fit_on_texts doing?

How to use Keras Tokenizer method fit_on_texts ?

How does it differ from fit_on_sequences ?

Solution

fit_on_texts used in conjunction with texts_to_matrix produces the one-hot encoding for a text, see https://www.tensorflow.org/text/guide/word_embeddings

fit_on_texts

An example for using fit_on_texts

from keras.preprocessing.text import Tokenizer
text='check check fail'
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
tokenizer.word_index

will produce {'check': 1, 'fail': 2}

Note that we use [text] as an argument since input must be a list, where each element of the list is considered a token. Input can also be a text generator or a list of list of strings.

Passing a text generator as an input is memory efficient, here an example: (1) defining a text generator returning an iterable collection of texts

def text_generator(texts_generator):
    for texts in texts_generator:
        for text in texts:
            yield text

(2) passing it as an input to fit_on_texts

tokenizer.fit_on_text(text_generator)

fit_on_texts is used before calling texts_to_matrix which produces the one-hot encoding for the original set of texts.

num_words argument

Passing the num_words argument to the tokenizer will specify the number of (most frequent) words we consider in the representation. An example, first num_words = 1 and we just encode on the most frequent word, love

sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

tokenizer = Tokenizer(num_words = 1+1)
tokenizer.fit_on_texts(sentences)
tokenizer.texts_to_sequences(sentences) # [[1], [1], [1]]

Second, num_words = 100, we encode on the 100 most frequent words

tokenizer = Tokenizer(num_words = 100+1)
tokenizer.fit_on_texts(sentences)
tokenizer.texts_to_sequences(sentences) # [[3, 1, 2, 4], [3, 1, 2, 5], [6, 1, 2, 4]]

fit_on_sequences

Fit_on_sequences works on "sequences" i.e., lists of integer word indices. It is used before calling sequence_to_matrix

from tensorflow.keras.preprocessing.text import Tokenizer
test_seq = [[1,2,3,4,5,6]]
tok = Tokenizer(num_words=10)
tok.fit_on_sequences(test_seq)
tok.sequences_to_matrix(test_seq)

Producing

array([[0., 1., 1., 1., 1., 1., 1., 0., 0., 0.]])