Search code examples
deep-learningpytorchbert-language-modeltransformer-model

The essence of learnable positional embedding? Does embedding improve outcomes better?


I was recently reading the bert source code from the hugging face project. I noticed that the so-called "learnable position encoding" seems to refer to a specific nn.Parameter layer when it comes to implementation.

def __init__(self):
    super()
    positional_encoding = nn.Parameter()
def forward(self, x):
    x += positional_encoding

↑ Could be this feeling, then performed the learnable position encoding. Whether that means it's that simple or not, I'm not sure I understand it correctly, I want to ask someone with experience.

In addition, I noticed a classic bert structure whose location is actually coded only once at the initial input. Does this mean that the subsequent bert layers, for each other, lose the ability to capture location information?

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(...)
      ...
  (pooler): BertPooler(...)

Would I get better results if the results of the previous layer were re-positional encoded before the next BERT layer?


Solution

  • What is the purpose of positional embeddings?

    In transformers (BERT included) the only interaction between the different tokens is done via self-attention layers. If you look closely at the mathematical operation implemented by these layers you will notice that these layers are permutation equivariant: That is, the representation of
    "I do like coding"
    and
    "Do I like coding"
    is the same, because the words (=tokens) are the same in both sentences, only their order is different.
    As you can see, this "permutation equivariance" is not a desired property in many cases.
    To break this symmetry/equivariance one can simply "code" the actual position of each word/token in the sentence. For example:
    "I_1 do_2 like_3 coding_4"
    is no longer identical to
    "Do_1 I_2 like_3 coding_4"

    This is the purpose of positional encoding/embeddings -- to make self-attention layers sensitive to the order of the tokens.

    Now to your questions:

    1. learnable position encoding is indeed implemented with a simple single nn.Parameter. The position encoding is just a "code" added to each token marking its position in the sequence. Therefore, all it requires is a tensor of the same size as the input sequence with different values per position.
    2. Is it enough to introduce position encoding once in a transformer architecture? Yes! Since transformers stack multiple self-attention layers it is enough to add positional embeddings once at the beginning of the processing. The position information is "fused" into the semantic representation learned per token.
      A nice visualization of this effect in Vision Transformers (ViT) can be found in this work:
      Shir Amir, Yossi Gandelsman, Shai Bagon and Tali Dekel Deep ViT Features as Dense Visual Descriptors (arXiv 2021).
      In sec. 3.1 and fig. 3 they show how the position information dominates the representation of tokens at early layers, but as you go deeper in a transformer, semantic information takes over.