Search code examples
huggingface-transformersbert-language-modelword-embedding

BERT without positional embeddings


I am trying to build a pipeline in HuggingFace which will not use the positional embeddings in BERT, in order to study the role of the embeddings for a particular use case. I have looked through the documentation and the code, but I have not been able to find a way to implement a model like that. Will I need to modify BERT source code, or is there a configuration I can fiddle around with?


Solution

  • You can do a workaround by setting the position embedding layer to zeros. When you check, the embeddings part of BERT, you can see that the position embeddings are there as a separate PyTorch module:

    from transformers import AutoModel
    bert = AutoModel.from_pretrained("bert-base-cased")
    print(bert.embeddings)
    
    BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    

    You can assign the position embedding parameters whatever value you want, including zeros, which will effectively disable the position embeddings:

    bert.embeddings.position_embeddings.weight.data = torch.zeros((512, 768))
    

    If you plan to fine-tune the modified model, make sure the zeroed parameters do not get updated by setting:

    bert.embeddings.position_embeddings.requires_grad_ = False
    

    This sort of bypassing the position embeddings might work well when you train a model from scratch. When you work with a pre-trained model, such removal of some parameters might confuse the models quite a bit, so more fine-tuning data might be needed. In this case, there might be better strategies on how to replace the position embeddings, e.g., using the average value for all positions.