Search code examples
pythontensorflowkerasnlp

How can I optimize my code to inverse transform the output of TextVectorization?


I'm using a TextVectorization Layer in a TF Keras Sequential model. I need to convert the intermediate TextVectorization layer's output to plain text. I've found that there is no direct way to accomplish this. So I used the TextVectorization layer's vocabulary to inverse transform the vectors. The code is as follows:

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
text_list = np.array(["this is the first sentence.","second line of the dataset."]) # a list of 2 sentences
textvectorizer = TextVectorization(max_tokens=None,
            standardize=None,
            ngrams=None,
            output_mode="int",
            output_sequence_length=None,
            pad_to_max_tokens=False)
textvectorizer.adapt(text_list)
vectors = textvectorizer(text_list)
vectors 

Vectors:

array([[ 3,  7,  2,  9,  4],
       [ 5,  6,  8,  2, 10]])

Now, I want to convert the vectors to sentences.

my_vocab = textvectorizer.get_vocabulary()
plain_text_list = []
for line in vectors:
    text = ' '.join(my_vocab[idx] for idx in line)
    plain_text_list.append(text)

print(plain_text_list)

Output:

['this is the first sentence.', 'second line of the dataset.']

I was successful in obtaining the desired result. However, due to the naive approach I used in the code, when applied to a large dataset, this method is extremely slow. I'd like to reduce the execution time of this method.


Solution

  • Maybe try np.vectorize:

    import numpy as np
    
    my_vocab = textvectorizer.get_vocabulary()
    index_vocab =  dict(zip(np.arange(len(my_vocab)), my_vocab))
    print(np.vectorize(index_vocab.get)(vectors))
    
    [['this' 'is' 'the' 'first' 'sentence.']
     ['second' 'line' 'of' 'the' 'dataset.']]
    

    Performance test:

    import numpy as np
    import timeit
    
    my_vocab = textvectorizer.get_vocabulary()
    
    def method1(my_vocab, vectors):
      index_vocab =  dict(zip(np.arange(len(my_vocab)), my_vocab))
      return np.vectorize(index_vocab.get)(vectors)
    
    def method2(my_vocab, vectors):
      plain_text_list = []
      for line in vectors:
          text = ' '.join(my_vocab[idx] for idx in line)
          plain_text_list.append(text)
      return plain_text_list
    
    t1 = timeit.Timer(lambda: method1(my_vocab, vectors))
    t2 = timeit.Timer(lambda: method2(my_vocab, vectors)) 
    
    print(t1.timeit(5000))
    print(t2.timeit(5000))
    
    0.3139600929998778
    19.671524284000043