Search code examples
pythontensorflowkerasnlptensorflow2.0

How does text encoding from tensorflow.keras.preprocessing.text.Tokenizer differ from the old tfds.deprecated.text.TokenTextEncoder


tfds.deprecated.text.TokenTextEncoder

In the deprecated encoding method with tfds.deprecated.text.TokenTextEncoder We first create a vocab set of token

tokenizer = tfds.deprecated.text.Tokenizer()
vocabulary_set = set()

#imdb_train --> imdb dataset from tensorflow_datasets
for example, label in imdb_train:
    some_tokens = tokenizer.tokenize(example.numpy())

Then load it into the encoder

encoder = tfds.deprecated.text.TokenTextEncoder(vocabulary_set,
                                                   lowercase=True,
                                                   tokenizer=tokenizer)

Afterward when performing encoding I notice the encoder will output a single integer, for example while debugging I found that the word "the" got encoded with 112

 token_id = encoder.encode(word)[0]
>> token_id = 112

But then when it comes to

tensorflow.keras.preprocessing.text.Tokenizer

tokenizer = tensorflow.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(words)
token_id = tokenizer.texts_to_sequences(word) #word = the
>> token_id = [800,2085,936]

It produces a sequence of 3 integers, so now do I use all 3 numbers or should it be also correct if I take just 1 number in that sequence? I'm trying to use this encoded integer to create embedding matrix using Glove Embedding. The old deprecated one produces just one integer so it's easier to map, with integer sequence I'm not sure how to proceed


Solution

  • Maybe try something like this:

    import tensorflow as tf
    
    lines = ['You are a fish', 'This is a fish', 'Where are the fishes']
    tokenizer = tf.keras.preprocessing.text.Tokenizer()
    tokenizer.fit_on_texts(lines)
    text_sequences = tokenizer.texts_to_sequences(lines)
    text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences, padding='post')
    vocab_size = len(tokenizer.word_index) + 1
    print(tokenizer.word_index)
    print(vocab_size)
    print(tokenizer.texts_to_sequences(['fish'])[0])
    
    {'are': 1, 'a': 2, 'fish': 3, 'you': 4, 'this': 5, 'is': 6, 'where': 7, 'the': 8, 'fishes': 9}
    10
    [3]
    

    The index 0 is reserved for the padding token. And then to create the weight matrix with the Glove model, try this:

    import gensim.downloader as api
    import numpy as np
    
    model = api.load("glove-twitter-25")
    embedding_dim = 25
    weight_matrix = np.zeros((vocab_size, embedding_dim))
    for word, i in tokenizer.word_index.items():
      try:
        embedding_vector = model[word]
        weight_matrix[i] = embedding_vector
      except KeyError:
        weight_matrix[i] = np.random.uniform(-5, 5, embedding_dim)
    print(weight_matrix.shape)
    # (10, 25)