I am training a model on DUC2004 and Giga word corpus, for which I am using Tokenizer() from keras as follows:
tokenizer = Tokenizer(num_of_words)
tokenizer.fit_on_texts(list(x_train))
#convert text sequences into integer sequences
train_seq = tokenizer.texts_to_sequences(x_train)
val_seq = tokenizer.texts_to_sequences(y_val)
#padding zero upto maximum length
train_seq = pad_sequences(train_seq, maxlen=max_summary_len, padding='post')
val_seq = pad_sequences(val_seq, maxlen=max_summary_len, padding='post')
When I try to change the sequences back to texts, the word order changes and gives some weird output.
For example:
Actual sentence:
chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them
Sequence to text conversion:
police were wednesday for the bodies of four kidnapped foreigners who were during a to free them
I tried using the sequence_to_text() method of Tokenier() as well as mapping words using the word_index.
I am not able to understand why this happens and how to correct it.
Your X_train
should be a list of raw text where each element of this list corresponds to a docuemnt (text). Try below code:
x_train = ['chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them',
'I am training a model on DUC2004 and Giga word corpus, for which I am using Tokenizer() from keras as follows']
tokenizer = Tokenizer(1000)
tokenizer.fit_on_texts(x_train)
train_seq = tokenizer.texts_to_sequences(x_train)
train_seq = pad_sequences(train_seq, maxlen=100, padding='post')
tokenizer.sequences_to_texts(train_seq)
Output:
['chechen police were searching wednesday for the bodies of four kidnapped foreigners who were beheaded during a botched attempt to free them',
'i am training a model on duc2004 and giga word corpus for which i am using tokenizer from keras as follows']