python tensorflow deep-learning huggingface-transformers

Adding Special Tokens Changes all Embeddings - TF Bert Hugging Face

Given the following,

from transformers import TFAutoModel
from transformers import BertTokenizer


bert = TFAutoModel.from_pretrained('bert-base-cased')
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

I expected that if special tokens are added to the tokens, the remaining tokens would remain the same and yet they do not. For example I expected that the following should be equal but all the tokens change. Why is this?

tokens = tokenizer(['this product is no good'], add_special_tokens=True,return_tensors='tf')
output = bert(tokens)

output[0][0][1]

tokens = tokenizer(['this product is no good'], add_special_tokens=False,return_tensors='tf')
output = bert(tokens)

output[0][0][0]

Solution

When setting add_special_tokens=True, you are including the [CLS] token in the front and the [SEP] token at the end of your sentence, which leads to a total of 7 tokens instead of 5:

tokens = tokenizer(['this product is no good'], add_special_tokens=True, return_tensors='tf')
print(tokenizer.convert_ids_to_tokens(tf.squeeze(tokens['input_ids'], axis=0)))

['[CLS]', 'this', 'product', 'is', 'no', 'good', '[SEP]']

Your sentence level embeddings are different, because these two special tokens become a part of your embedding as they are propagated through the BERT model. They are not masked like padding tokens [pad]. Check out the docs for more information.

If you take a closer look at how Bert's Transformer-Encoder architecture and attention mechanism works, you will quickly understand why a single difference between two sentences will generate different hidden_states. New tokens are not simply concatenated to existing ones. In a sense, the tokens depend on each other. According to the BERT author Jacob Devlin:

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations.

Or another interesting discussion:

[...] The value of CLS is influenced by other tokens, just like other tokens are influenced by their context (attention).