Search code examples
python-3.xtextkerastokenizeone-hot-encoding

How to convert keras tokenizer.texts_to_matrix (one-hot encoded matrix) of words back to text


I referred to this post which discusses how to get back text from text_to_sequences function of tokenizer in keras using the reverse_map strategy.

I wonder if there is a function to get back text for the text_to_matrix function.

Example:

from tensorflow.keras.preprocessing.text import Tokenizer

docs = ['Well done!',
    'Good work',
    'Great effort',
    'nice work',
    'Excellent!']

# create the tokenizer
t = Tokenizer()

# fit the tokenizer on the documents
t.fit_on_texts(docs)
print(t)
encoded_docs = t.texts_to_matrix(docs, mode='count')
print(encoded_docs)
print(t.word_index.items())

Output: 
<keras_preprocessing.text.Tokenizer object at 0x7f746b6594e0>
[[0. 0. 1. 1. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 1. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1.]]
dict_items([('work', 1), ('well', 2), ('done', 3), ('good', 4), ('great', 5), ('effort', 6), 
('nice', 7), ('excellent', 8)])

How to get back docs from from the one-hot matrix?


Solution

  • For one-hot matrix that is predicted instead of being given, I came up with the following solution:

    def onehot_to_text (mat,tokenizer, cutoff):
        mat = pd.DataFrame(mat)
        mat.rename(columns=tokenizer.index_word, inplace=True)
        output = mat.sum(axis=1)
        for row in range(mat.shape[0]):
           if output[row] == 0:
              output[row] = []
           else:
              output[row] = mat.columns[mat.iloc[row,:] >= cutoff].tolist()
       return(output)
    

    onehot_to_text(encoded_docs,t, 0.5) gives the corresponding list of text.

    This function can handle rows with all zeros.