Search code examples
pythonnlppytorchbert-language-modelattention-model

XLM/BERT sequence outputs to pooled output with weighted average pooling


Let's say I have a tokenized sentence of length 10, and I pass it to a BERT model.

bert_out = bert(**bert_inp)
hidden_states = bert_out[0]
hidden_states.shape
>>>torch.Size([1, 10, 768])

This returns me a tensor of shape: [batch_size, seq_length, d_model] where each word in sequence is encoded as a 768-dimentional vector

In TensorFlow BERT also returns a so called pooled output which corresponds to a vector representation of a whole sentence.
I want to obtain it by taking a weighted average of sequence vectors and the way I do it is:

hidden_states.view(-1, 10).shape
>>> torch.Size([768, 10])

pooled = nn.Linear(10, 1)(hidden_states.view(-1, 10))
pooled.shape
>>> torch.Size([768, 1])
  • Is it the right way to proceed, or should I just flatten the whole thing and then apply linear?
  • Any other ways to obtain a good sentence representation?

Solution

  • There are two simple ways to get a sentence representation:

    • Get the vector for the CLS token.
    • Get the pooler_output

    Assuming the input is [batch_size, seq_length, d_model], where batch_size is the number of sentences, then to get the CLS token for every sentence:

    bert_out = bert(**bert_inp)
    hidden_states = bert_out['last_hidden_state']
    cls_tokens = hidden_states[:, 0, :]  # 0 for the CLS token for every sentence.
    

    You will have a tensor with shape (batch_size, d_model).

    To get the pooler_output:

    bert_out = bert(**bert_inp)
    pooler_output = bert_out['pooler_output']
    

    Again you get a tensor with shape (batch_size, d_model).