huggingface-transformers bert-language-model word-embedding

BERT without positional embeddings

I am trying to build a pipeline in HuggingFace which will not use the positional embeddings in BERT, in order to study the role of the embeddings for a particular use case. I have looked through the documentation and the code, but I have not been able to find a way to implement a model like that. Will I need to modify BERT source code, or is there a configuration I can fiddle around with?

Solution

You can do a workaround by setting the position embedding layer to zeros. When you check, the embeddings part of BERT, you can see that the position embeddings are there as a separate PyTorch module:

from transformers import AutoModel
bert = AutoModel.from_pretrained("bert-base-cased")
print(bert.embeddings)

BertEmbeddings(
  (word_embeddings): Embedding(28996, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (token_type_embeddings): Embedding(2, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

You can assign the position embedding parameters whatever value you want, including zeros, which will effectively disable the position embeddings:

bert.embeddings.position_embeddings.weight.data = torch.zeros((512, 768))

If you plan to fine-tune the modified model, make sure the zeroed parameters do not get updated by setting:

bert.embeddings.position_embeddings.requires_grad_ = False

This sort of bypassing the position embeddings might work well when you train a model from scratch. When you work with a pre-trained model, such removal of some parameters might confuse the models quite a bit, so more fine-tuning data might be needed. In this case, there might be better strategies on how to replace the position embeddings, e.g., using the average value for all positions.