nlp pytorch huggingface-transformers bert-language-model

Confusion in understanding the output of BERTforTokenClassification class from Transformers library

It is the example given in the documentation of transformers pytorch library

from transformers import BertTokenizer, BertForTokenClassification
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForTokenClassification.from_pretrained('bert-base-uncased', 
                      output_hidden_states=True, output_attentions=True)

input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", 
                         add_special_tokens=True)).unsqueeze(0)  # Batch size 1
labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids, labels=labels)

loss, scores, hidden_states,attentions = outputs

Here hidden_states is a tuple of length 13 and contains hidden-states of the model at the output of each layer plus the initial embedding outputs. I would like to know, whether hidden_states[0] or hidden_states[12] represent the final hidden state vectors?

Solution

If you check the source code, specifically BertEncoder, you can see that the returned states are initialized as an empty tuple and then simply appended per iteration of each layer.

The final layer is appended as the last element after this loop, see here, so we can safely assume that hidden_states[12] is the final vectors.