Search code examples

Getting embeddings from wav2vec2 models in HuggingFace

I am trying to get the embeddings from pre-trained wav2vec2 models (e.g., from jonatasgrosman/wav2vec2-large-xlsr-53-german) using my own dataset.

My aim is to use these features for a downstream task (not specifically speech recognition). Namely, since the dataset is relatively small, I would train an SVM with these embeddings for the final classification.

So far I have tried this:

model_name = "facebook/wav2vec2-large-xlsr-53-german"
feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)

input_values = feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True, 
                                 feature_size=1, sampling_rate=16000 ).input_values 

Then, I am not sure whether the embeddings here correspond to the sequence of last_hidden_states:

hidden_states = model(input_values).last_hidden_state

or to the sequence of features of the last conv layer of the model:

features_last_cnn_layer = model(input_values).extract_features

Also, is this the correct way to extract features from a pre-trained model?

How one can get embeddings from a specific layer?

PD: Posting here as the HuggingFace's forum seems to be less active.


  • Just check the documentation:

    last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

    extract_features (torch.FloatTensor of shape (batch_size, sequence_length, conv_dim[-1])) – Sequence of extracted feature vectors of the last convolutional layer of the model.

    • The last_hidden_state vector represents so called contextualized embeddings (i.e. every feature (CNN output) has a vector representation that is to some extend influenced by the other tokens of the sequence).
    • The extract_features vector represents the embeddings of your input (after the CNNs). .

    Also, is this the correct way to extract features from a pre-trained model?

    How one can get embeddings from a specific layer? Set output_hidden_states=True:

    o = model(input_values,output_hidden_states=True)


    odict_keys(['last_hidden_state', 'extract_features', 'hidden_states'])

    The hidden_states value contains the embeddings and the contextualized embeddings of each attention layer.

    P.S.: jonatasgrosman/wav2vec2-large-xlsr-53-german model was trained with feat_extract_norm==layer. That means, you should also pass an attention mask to the model:

    model_name = "facebook/wav2vec2-large-xlsr-53-german"
    feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
    model = Wav2Vec2Model.from_pretrained(model_name)
    i= feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True, 
                                     feature_size=1, sampling_rate=16000 )