Search code examples
pythontensorflowkerasnlptokenize

Keras Tokenizer sequence to text changes word order


I am training a model on DUC2004 and Giga word corpus, for which I am using Tokenizer() from keras as follows:

tokenizer = Tokenizer(num_of_words) 
tokenizer.fit_on_texts(list(x_train))

#convert text sequences into integer sequences
train_seq    =   tokenizer.texts_to_sequences(x_train) 
val_seq   =   tokenizer.texts_to_sequences(y_val) 

#padding zero upto maximum length
train_seq    =   pad_sequences(train_seq, maxlen=max_summary_len, padding='post')
val_seq   =   pad_sequences(val_seq, maxlen=max_summary_len, padding='post') 

When I try to change the sequences back to texts, the word order changes and gives some weird output.

For example:

Actual sentence:

chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them

Sequence to text conversion:

police were wednesday for the bodies of four kidnapped foreigners who were during a to free them

I tried using the sequence_to_text() method of Tokenier() as well as mapping words using the word_index.

I am not able to understand why this happens and how to correct it.


Solution

  • Your X_train should be a list of raw text where each element of this list corresponds to a docuemnt (text). Try below code:

    x_train = ['chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them',
            'I am training a model on DUC2004 and Giga word corpus, for which I am using Tokenizer() from keras as follows']
    
    tokenizer = Tokenizer(1000) 
    tokenizer.fit_on_texts(x_train)
    
    train_seq = tokenizer.texts_to_sequences(x_train)
    train_seq = pad_sequences(train_seq, maxlen=100, padding='post')
    
    tokenizer.sequences_to_texts(train_seq)
    

    Output:

    ['chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them',
     'i am training a model on duc2004 and giga word corpus for which i am using tokenizer from keras as follows']