Search code examples
python-3.xdata-structuresdatasetkerasreuters

Reconstruct news texts from Keras' reuters dataset


I cant seem to make sense of the dataset provided by Keras' reuters dataset.

The set is loaded like so:

(x_train, y_train), (x_test, y_test) = reuters.load_data()

As far as I understand the "x" arrays are arrays of sequences (lists) of word indices from news stories and the "y" arrays are arrays of the topics of these sequences.

But when I try to translate the word indices of one of the sequences with the provided dictionary into actual words:

wordDict = {y:x for x,y in reuters.get_word_index().items()}  
for index in x_train[0]:
    print (wordDict.get(index))

The sequence seems to make no sense. How do I turn the sequences back into the original news?

Edit: found a similar thread here. Seems like there is a problem with the indices in the dictionary not matching the word indices in the dataset. But redownloading the data does not resolve the problem for me.


Solution

  • The default value for the load_data argument "index_from" lets the indices of actual word to >3. One can reconstruct the texts by using wordDict.get(index - 3).