vector nlp embedding transformer-model bert-language-model

Why can Bert's three embeddings be added?

I already know the meaning of Token Embedding, Segment Embedding, and Position Embedding. But why can these three vectors be added together? The Size and direction of vectors will change after the addition, and the semantics of the word will also change. (It's the same question for the Transformer model which has two Embeddings named Input Embedding and Position Embedding.)

Solution

Firstly, these vectors are added element-wise -> The size of the embeddings stays the same.

Secondly, position plays a significant role in the meaning of a token, so it should somehow be part of the embedding. Attention: The token embeddinng does not necessarily hold semantic information as we now it from word2vec, all those embeddings(token, segment and position) are learned together in pre-training, so that they best accomplish the tasks together. In pre-training, they are already added together, so they are trained especially for this case. Direction of vectors do change with this addition, but the new direction gives important information to the model, packed in just one vector.

Note: Each vector is huge (768 dimensions in the base model)