Search code examples
pythontensorflowkerastokenizetext-processing

What is Keras' Tokenizer fit_on_sequences used for?


I'm familiar with the method 'fit_on_texts' from the Keras' Tokenizer. What does 'fit_on_sequences' do and when is it useful? According to the documentation, it "Updates internal vocabulary based on a list of sequences.", and it takes as input: 'A list of sequence. A "sequence" is a list of integer word indices.'. When is this useful?

For fitting on texts, I understand the text is parsed into tokens and each token is assigned an index (integer). Thus, the tokenizer object contains, among other things, a dictionary relating tokens (strings) and indices (integers). However, if I give it only a sequence of numbers and call fit_on_sequences, how would it know what tokens do these things represent?

As an experiment, try the following:

from tensorflow.keras.preprocessing.text import Tokenizer
test_seq = [[1,2,3,4,5,6]]
tok = Tokenizer()
tok.fit_on_sequences(test_seq)

Then, the properties word_index or index_word, which would otherwise contain the dictionary of values are, of course, empty. The documentation also states about fit_on_sequences: "Required before using sequences_to_matrix (if fit_on_texts was never called).", however, calling sequences_to_matrix after calling only fit_on_sequences (not fit_on_texts) does not work. So, what is fit_on_sequences used for?


Solution

  • sequences_to_matrix does work after calling fit_on_sequences, you just need to specify the argument num_words in the Tokenizer() instantiation.

    from tensorflow.keras.preprocessing.text import Tokenizer
    
    test_seq = [[1,2,3,4,5,6]]
    
    tok = Tokenizer(num_words=10)
    tok.fit_on_sequences(test_seq)
    
    tok.sequences_to_matrix(test_seq)
    
    array([[0., 1., 1., 1., 1., 1., 1., 0., 0., 0.]])
    

    The zero at the beginning is there because there is no 0 in your sequence, and the zeroes at the end are because I specified 10 num_words but the highest value in your test sequence in 6.

    The purpose it serves is just skipping the step of mapping an integer to a string. It only uses the integer.