In the paper CountNet: Estimating the Number of Concurrent Speakers Using Supervised Learning I recently read, it specified that the 3D volume output from a CNN layer must be reduced into a 2 dimensional sequence before entering the LSTM layer, why is that? What's wrong with using the 3 dimensional format?
The standard LSTM neural network assumes input of the following size:
[batch size] × [sequence length] × [feature dim]
The LSTM first multiplies each vector of size [feature dim]
by a matrix, and then combines them in a fancy way. What's important here is that there's a vector per each example (the batch dimensions) and each timestep (the seq. length dimension). In a sense, this vector is first transformed by a matrix multiplication(s) (possibly involving some pointwise non-linearities, which don't change the shape, so I don't mention them) into a hidden state update, which is also a vector, and the updated hidden state vector is then used to produce the output (also a vector).
As you can see, the LSTM is designed to operate on vectors. You could design a Matrix-LSTM – an LSTM counterpart that assumes any or all of the following are matrices: the input, the hidden state, the output. That would require you to replace matrix-vector multiplications that process the input (or the state) by a generatlized linear operation that is able to turn any matrix into any other, which would be given by a rank-4 tensor, I believe. However, it'd be equivalent to just reshaping the input matrix into a vector, reshaping the rank-4 tensor into a matrix, doing matrix-vector product and then reshaping the output back into a matrix, so it makes little sense to devise such Matrix-LSTMs instead of just reshaping your inputs.
That said, it might still make sense to design a generalized LSTM that takes something other than a vector as input if the you know something about the input structure that instructs a more specific linear operator than a general rank-4 tensor. For example, images are known to have local structure (nearby pixels are more related than those far apart), hence using convolutions is more "reasonable" than reshaping images to vectors and then performing a general matrix multiplication. In a similar fashion you could replace all the matrix-vector multiplications in the LSTM with convolutions, which would allow for image-like input, states and outputs.