I am trying to understand LSTM for speech recognition. I do understand that LSTM here basically generates phone index (let's say each unique phone is mapped to a unique integer) at the output for every MFCC feature (let's say it is 13 dimensional) that we feed it, for every frame of speech.
There are multiple sources that help grasp LSTM for a 1D data, like this one
However, I do not know how to feed in a 13D data. All I could hypothesize is that we must have to use weights which take the transpose form of the input and biases as 1D scalars. This is unheard to me.
The LSTM will simply have 13 input neurons. The inputs are represented by weights (and biases) to the LSTMcell.