I am trying to get the embeddings from pre-trained wav2vec2 models (e.g., from jonatasgrosman/wav2vec2-large-xlsr-53-german) using my own dataset.
My aim is to use these features for a downstream task (not specifically speech recognition). Namely, since the dataset is relatively small, I would train an SVM with these embeddings for the final classification.
So far I have tried this:
model_name = "facebook/wav2vec2-large-xlsr-53-german"
feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)
input_values = feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True,
feature_size=1, sampling_rate=16000 ).input_values
Then, I am not sure whether the embeddings here correspond to the sequence of last_hidden_states:
hidden_states = model(input_values).last_hidden_state
or to the sequence of features of the last conv layer of the model:
features_last_cnn_layer = model(input_values).extract_features
Also, is this the correct way to extract features from a pre-trained model?
How one can get embeddings from a specific layer?
PD: Posting here as the HuggingFace's forum seems to be less active.
Just check the documentation:
last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.
extract_features (torch.FloatTensor of shape (batch_size, sequence_length, conv_dim[-1])) – Sequence of extracted feature vectors of the last convolutional layer of the model.
contextualized embeddings
(i.e. every feature (CNN output) has a vector representation that is to some extend influenced by the other tokens of the sequence).Also, is this the correct way to extract features from a pre-trained model?
Yes.
How one can get embeddings from a specific layer? Set
output_hidden_states=True
:
o = model(input_values,output_hidden_states=True)
o.keys()
Output:
odict_keys(['last_hidden_state', 'extract_features', 'hidden_states'])
The hidden_states
value contains the embeddings and the contextualized embeddings of each attention layer.
P.S.: jonatasgrosman/wav2vec2-large-xlsr-53-german model was trained with feat_extract_norm
==layer. That means, you should also pass an attention mask to the model:
model_name = "facebook/wav2vec2-large-xlsr-53-german"
feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)
i= feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True,
feature_size=1, sampling_rate=16000 )
model(**i)