I'm training an LSTM network and I'm looking to understand best practices for training on long sequences, O(1k) length or more. What is a good approach to choosing a minibatch size? How would skew in label prevalence influence that choice? (Positives are rare in my scenario). Is it worthwhile to make an effort to rebalance my data? Thanks.
You probably want to rebalance so they are 50/50. Otherwise it will skew to one class or another.
As for the batch size I would go as large as will fit in memory.
I am not sure the LSTMs will be able to learn dependencies on the O(1k) but it is worth a try. You could look into doing something like wavenet if you want ultra long dependencies.
https://deepmind.com/blog/wavenet-generative-model-raw-audio/