why the positional encoding is (1,patch,emb) size, it should be (batch_size,patch,emb) in general
even in the pytorch github code https://github.com/pytorch/vision/blob/main/torchvision/models/vision_transformer.py they are defining
self.pos_embedding = nn.Parameter(torch.empty(1, seq_length, hidden_dim).normal_(std=0.02)) # from BERT
can anyone help me, what should I use as pos_encoding in my code
self.pos_embedding = nn.Parameter(torch.empty(batch_size, seq_length, hidden_dim).normal_(std=0.02))
is it correct?
Because you dont know the batch_size when initializing self.pos_embedding, so you should init this tensor as:
self.pos_embedding = nn.Parameter(
torch.empty(1, num_patches + 1, hidden_dim).normal_(std=0.02)
)
# (dont forget about the cls token)
PyTorch will take care of the tensors broadcasting in forward pass:
x = x + self.pos_embedding
# (batch_size, num_patches + 1, embedding_dim) + (1, num_patches + 1, embedding_dim) is ok
But it won't work with cls token. You should expand this tensor in forward:
cls_token = self.cls_token.expand(
batch_size, -1, -1
)