Let's say I have a tokenized sentence of length 10, and I pass it to a BERT model.
bert_out = bert(**bert_inp)
hidden_states = bert_out[0]
hidden_states.shape
>>>torch.Size([1, 10, 768])
This returns me a tensor of shape: [batch_size, seq_length, d_model] where each word in sequence is encoded as a 768-dimentional vector
In TensorFlow BERT also returns a so called pooled output which corresponds to a vector representation of a whole sentence.
I want to obtain it by taking a weighted average of sequence vectors and the way I do it is:
hidden_states.view(-1, 10).shape
>>> torch.Size([768, 10])
pooled = nn.Linear(10, 1)(hidden_states.view(-1, 10))
pooled.shape
>>> torch.Size([768, 1])
There are two simple ways to get a sentence representation:
CLS
token.pooler_output
Assuming the input is [batch_size, seq_length, d_model]
, where batch_size
is the number of sentences, then to get the CLS token for every sentence:
bert_out = bert(**bert_inp)
hidden_states = bert_out['last_hidden_state']
cls_tokens = hidden_states[:, 0, :] # 0 for the CLS token for every sentence.
You will have a tensor with shape (batch_size, d_model).
To get the pooler_output
:
bert_out = bert(**bert_inp)
pooler_output = bert_out['pooler_output']
Again you get a tensor with shape (batch_size, d_model).