In the original paper Attention is all you need, the positional encoding is defined as: pe
but in Transformer's model_utils.py, I found that the formula is different at line 53. In the paper, the sin
and cos
functions appear alternately according to even or single dimension, while they are continuous in the half of the dimension respectively.
You are right, but I don't think that makes any difference. The representation of each position with the positional encoding is unique no matter you concatenate sin/cos
or make them alternately appear in the final vector.
As long as the encoding is unique and we always generate the encoding consistently, the positional information is preserved in the model.