Search code examples
pythontensorflowmachine-learningkerastokenize

What is Keras tokenizer.fit_on_texts doing?


How to use Keras Tokenizer method fit_on_texts ?

How does it differ from fit_on_sequences ?


Solution

  • fit_on_texts used in conjunction with texts_to_matrix produces the one-hot encoding for a text, see https://www.tensorflow.org/text/guide/word_embeddings

    enter image description here

    fit_on_texts

    An example for using fit_on_texts

    from keras.preprocessing.text import Tokenizer
    text='check check fail'
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts([text])
    tokenizer.word_index
    

    will produce {'check': 1, 'fail': 2}

    Note that we use [text] as an argument since input must be a list, where each element of the list is considered a token. Input can also be a text generator or a list of list of strings.

    Passing a text generator as an input is memory efficient, here an example: (1) defining a text generator returning an iterable collection of texts

    def text_generator(texts_generator):
        for texts in texts_generator:
            for text in texts:
                yield text
    

    (2) passing it as an input to fit_on_texts

    tokenizer.fit_on_text(text_generator)
    

    fit_on_texts is used before calling texts_to_matrix which produces the one-hot encoding for the original set of texts.

    num_words argument

    Passing the num_words argument to the tokenizer will specify the number of (most frequent) words we consider in the representation. An example, first num_words = 1 and we just encode on the most frequent word, love

    sentences = [
        'i love my dog',
        'I, love my cat',
        'You love my dog!'
    ]
    
    tokenizer = Tokenizer(num_words = 1+1)
    tokenizer.fit_on_texts(sentences)
    tokenizer.texts_to_sequences(sentences) # [[1], [1], [1]]
    

    Second, num_words = 100, we encode on the 100 most frequent words

    tokenizer = Tokenizer(num_words = 100+1)
    tokenizer.fit_on_texts(sentences)
    tokenizer.texts_to_sequences(sentences) # [[3, 1, 2, 4], [3, 1, 2, 5], [6, 1, 2, 4]]
    

    fit_on_sequences

    Fit_on_sequences works on "sequences" i.e., lists of integer word indices. It is used before calling sequence_to_matrix

    from tensorflow.keras.preprocessing.text import Tokenizer
    test_seq = [[1,2,3,4,5,6]]
    tok = Tokenizer(num_words=10)
    tok.fit_on_sequences(test_seq)
    tok.sequences_to_matrix(test_seq)
    

    Producing

    array([[0., 1., 1., 1., 1., 1., 1., 0., 0., 0.]])