I have been following the tutorial for feature extraction using pytorch audio here: https://pytorch.org/audio/0.10.0/pipelines.html#wav2vec-2-0-hubert-representation-learning
It says the result is a list of tensors of lenth 12 where each entry is the output of a transformer layer. So, the first tensor on the list has shape of something like (1,2341,768)
.
It seems to be correct as I get this result for most audios.
However, for some videos, I get returned a tensor of length 12, but the entries have more than 1 batchsize bizzarely. So the shape is (2,2341,768)
I am baffled as to why this is?
Any clues would be great.
This is likely to be coming from your incoming audio being multi-channel (stereo for example). You can check the shape of your input tensor to see if the input is "batched" too, since it would be of shape (2, L) with L being the length of the audio. Then each layer of the model gives you a representation of shape (2, L', D), L' being the length of output sequence and D the number of features of the model.