Positional encoding for VIsion transformer

why the positional encoding is (1,patch,emb) size, it should be (batch_size,patch,emb) in general even in the pytorch github code https://github.com/pytorch/vision/blob/main/torchvision/models/vision_transformer.py they are defining
self.pos_embedding = nn.Parameter(torch.empty(1, seq_length, hidden_dim).normal_(std=0.02)) # from BERT

can anyone help me, what should I use as pos_encoding in my code

self.pos_embedding = nn.Parameter(torch.empty(batch_size, seq_length, hidden_dim).normal_(std=0.02))

is it correct?

Solution

Because you dont know the batch_size when initializing self.pos_embedding, so you should init this tensor as:

self.pos_embedding = nn.Parameter(
    torch.empty(1, num_patches + 1, hidden_dim).normal_(std=0.02)
) 
# (dont forget about the cls token)

PyTorch will take care of the tensors broadcasting in forward pass:

x = x + self.pos_embedding
# (batch_size, num_patches + 1, embedding_dim) + (1, num_patches + 1, embedding_dim) is ok

But it won't work with cls token. You should expand this tensor in forward:

cls_token = self.cls_token.expand(
    batch_size, -1, -1
)