python arrays numpy tokenize recurrent-neural-network

trying to slice array results in "Too many indices for array". Can I pad the array to fix this?

I've seen the multitude of questions about this particular error. I believe my question is different enough to warrant its own post.

My objective: I am building an RNN that generates news headlines. It will predict the next word based on the words that came before it. This code is from an example and I am trying to adapt it to work for my situation. I am trying to slice the array into an X and y.

The issue: I understand that the error appears because the array is being indexed as if it were a 2d array, but it is actually a 1d array. Before converting sequences to an array, it is a list of lists, but not all of the nested lists are the same length so numPy converts it to a 1d array.

My question(s): Is there a simple or elegant way to pad sequences so that all of the lists are the same length? Can I do this using spaces to keep the same meaning in the shorter headlines? Why do I need to change the list of lists to an array at all? As I said before, this is from an example and I am trying to understand what they did and why.

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Pretreat Data Section
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
# integer encode sequences of words
# create the tokenizer 
t = Tokenizer() 
# fit the tokenizer on the headlines 
t.fit_on_texts(headlines)
sequences = t.texts_to_sequences(headlines)

# vocabulary size
vocab_size = len(t.word_index) + 1

#separate into input and output
sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]     # fix this
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-87-eb7aab0c3a22> in <module>
     18 #separate into input and output
     19 sequences = np.array(sequences)
---> 20 X, y = sequences[:,:-1], sequences[:,-1]     # fix this
     21 y = to_categorical(y, num_classes=vocab_size)
     22 seq_length = X.shape[1]

IndexError: too many indices for array

Solution

Problem is that this tutorial has few parts on one page and every part has own "Complete Example"

First "Complete Example" reads text from republic_clean.txt, clear it and save it in republic_sequences.txt - it creates sequences with the same number of words.

Second "Complete Example" reads text from republic_sequences.txt and use it with

sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]

Because first part creates sequences with the same number of words so this code works correctly.

It seems you skiped first part. You have to go back to first part to learn how to clear text and how to create correct file which you can use in second part.

EDIT: if you can't make sequences with the same number of words then you may add spaces to shorter sequences. Code will work but I don't know if it will create better model.

sequences = [['a'], ['b','c'], ['d','e','f']]

max_len = max(map(len, sequences))

sequences = [x + [""]*(max_len-len(x)) for x in sequences]

print(sequences)

Result

[['a', '', ''], ['b', 'c', ''], ['d', 'e', 'f']]