python tensorflow keras nlp tensorflow2.0

How does text encoding from tensorflow.keras.preprocessing.text.Tokenizer differ from the old tfds.deprecated.text.TokenTextEncoder

tfds.deprecated.text.TokenTextEncoder

In the deprecated encoding method with tfds.deprecated.text.TokenTextEncoder We first create a vocab set of token

tokenizer = tfds.deprecated.text.Tokenizer()
vocabulary_set = set()

#imdb_train --> imdb dataset from tensorflow_datasets
for example, label in imdb_train:
    some_tokens = tokenizer.tokenize(example.numpy())

Then load it into the encoder

encoder = tfds.deprecated.text.TokenTextEncoder(vocabulary_set,
                                                   lowercase=True,
                                                   tokenizer=tokenizer)

Afterward when performing encoding I notice the encoder will output a single integer, for example while debugging I found that the word "the" got encoded with 112

 token_id = encoder.encode(word)[0]
>> token_id = 112

But then when it comes to

tensorflow.keras.preprocessing.text.Tokenizer

tokenizer = tensorflow.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(words)
token_id = tokenizer.texts_to_sequences(word) #word = the
>> token_id = [800,2085,936]

It produces a sequence of 3 integers, so now do I use all 3 numbers or should it be also correct if I take just 1 number in that sequence? I'm trying to use this encoded integer to create embedding matrix using Glove Embedding. The old deprecated one produces just one integer so it's easier to map, with integer sequence I'm not sure how to proceed

Solution

Maybe try something like this:

import tensorflow as tf

lines = ['You are a fish', 'This is a fish', 'Where are the fishes']
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(lines)
text_sequences = tokenizer.texts_to_sequences(lines)
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences, padding='post')
vocab_size = len(tokenizer.word_index) + 1
print(tokenizer.word_index)
print(vocab_size)
print(tokenizer.texts_to_sequences(['fish'])[0])

{'are': 1, 'a': 2, 'fish': 3, 'you': 4, 'this': 5, 'is': 6, 'where': 7, 'the': 8, 'fishes': 9}
10
[3]

The index 0 is reserved for the padding token. And then to create the weight matrix with the Glove model, try this:

import gensim.downloader as api
import numpy as np

model = api.load("glove-twitter-25")
embedding_dim = 25
weight_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
  try:
    embedding_vector = model[word]
    weight_matrix[i] = embedding_vector
  except KeyError:
    weight_matrix[i] = np.random.uniform(-5, 5, embedding_dim)
print(weight_matrix.shape)
# (10, 25)