python machine-learning keras training-data one-hot-encoding

What shape should take y data when predicting corresponding list for another list

Background

There are two lists of characters. One list contains piano notes and another string notes. Idea is to train model to predict string notes based on piano notes. So it generates string melody to piano. To make it more fluent it should take into account not only current piano note but also previous ones.

Data

I have created dataset with more than 100 songs (and still selecting new songs). At the moment total note count for piano and string list is 48523. Vocabulary for piano notes are 447 and for string notes 261

len(set(piano_notes)) #447
len(set(string_notes)) #261

All notes are one hot encoded, and sequence length is 100. Both list shapes:

print(x.shape) #(48523, 100, 447)
print(y.shape) #(48523, 100, 261)

Problem

It is unclear for me what shape it should take for y data? Network looks like this:

def create_network(x, n_vocab_string_notes):
""" create the structure of the neural network """
model = Sequential()

model.add(LSTM(
    512,
    input_shape=(x.shape[1], x.shape[2]),
    return_sequences=True
))
model.add(Dropout(0.3))
model.add(LSTM(512, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(512))
model.add(Dense(256))
model.add(Dropout(0.3))
model.add(Dense(n_vocab_string_notes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

return model

And training like this:

def train(model, x, y):
""" train the neural network """
file_path = "weights-improved.hdf5"
checkpoint = ModelCheckpoint(
    file_path,
    monitor='loss',
    verbose=0,
    save_best_only=True,
    mode='min'
)
callbacks_list = [checkpoint]

model.fit(x, y, epochs=200, batch_size=64, callbacks=callbacks_list)

Now it returns error because y shape is not as it should be. It says got an array with shape (48523, 100, 261) instead of expected 2 demensions.

Goal

Goal is to predict string notes based on piano notes. That is for piano notes, for example, 100 notes in length predict corresponding string notes in the same note length. So from single piano note list I could predict string note list. It means for every single piano melody there can be added corresponding string melody.

Solution

You are currently compressing the time dimension by not returning sequences in the second LSTM(100). You need to return a sequence as well and process upper layers. Something along the lines of:

# second lstm
model.add(LSTM(512, return_sequences=True))
model.add(TimeDistributed(Dense(256)))
model.add(Dropout(0.3))
model.add(TimeDistributed(Dense(n_vocab_string_notes, activation='softmax')))

Now the output will also be sequences.