So I've used RNN/LSTMs in three different capacities:
In none of these cases do I use the intermediate hidden states to generate my final output. Only the last layer outputs in case #1 and only the last layer hidden state in case #2 and #3. However, PyTorch nn.LSTM/RNN
returns a vector containing the final hidden state of every layer, so I assume they have some uses.
I'm wondering what some use cases of those intermediate layer states are?
There’s nothing explicitly requiring you to use the last layer only. You could feed in all of the layers to your final classifier MLP for each position in the sequence (or at the end, if you’re classifying the whole sequence).
As a practical example, consider the ELMo architecture for generating contextualized (that is, token-level) word embeddings. (Paper here: https://www.aclweb.org/anthology/N18-1202/) The representations are the hidden states of a multi-layer biRNN. Figure 2 in the paper shows how different layers differ in usefulness depending on the task. The authors suggest that lower levels encode syntax, while higher levels encode semantics.