I'm familiar with the method 'fit_on_texts' from the Keras' Tokenizer. What does 'fit_on_sequences' do and when is it useful? According to the documentation, it "Updates internal vocabulary based on a list of sequences.", and it takes as input: 'A list of sequence. A "sequence" is a list of integer word indices.'. When is this useful?
For fitting on texts, I understand the text is parsed into tokens and each token is assigned an index (integer). Thus, the tokenizer object contains, among other things, a dictionary relating tokens (strings) and indices (integers). However, if I give it only a sequence of numbers and call fit_on_sequences, how would it know what tokens do these things represent?
As an experiment, try the following:
from tensorflow.keras.preprocessing.text import Tokenizer
test_seq = [[1,2,3,4,5,6]]
tok = Tokenizer()
tok.fit_on_sequences(test_seq)
Then, the properties word_index or index_word, which would otherwise contain the dictionary of values are, of course, empty. The documentation also states about fit_on_sequences: "Required before using sequences_to_matrix (if fit_on_texts was never called).", however, calling sequences_to_matrix after calling only fit_on_sequences (not fit_on_texts) does not work. So, what is fit_on_sequences used for?
sequences_to_matrix
does work after calling fit_on_sequences
, you just need to specify the argument num_words
in the Tokenizer()
instantiation.
from tensorflow.keras.preprocessing.text import Tokenizer
test_seq = [[1,2,3,4,5,6]]
tok = Tokenizer(num_words=10)
tok.fit_on_sequences(test_seq)
tok.sequences_to_matrix(test_seq)
array([[0., 1., 1., 1., 1., 1., 1., 0., 0., 0.]])
The zero at the beginning is there because there is no 0 in your sequence, and the zeroes at the end are because I specified 10 num_words
but the highest value in your test sequence in 6.
The purpose it serves is just skipping the step of mapping an integer to a string. It only uses the integer.