I study visual attention models but have recently been reading up on BERT and other language attention models to fill a serious gap in my knowledge.
I am a bit confused by what I seem to be seeing in these model architectures. Given a sentence like "the cat chased the dog". I would have expected cross information streams between the embeddings of each word. For example, I would have expected to see a point in the model where the embedding for "cat" is combined with the embedding for "dog", in order to create the attention mask.
Instead what I seem to be seeing, (correct me if I am wrong) is that the embedding of a word like "cat" is initially set up to include information about the words around them. So that each embedding of each word includes all of the other words around them. Then each of these embeddings are passed through the model in parallel. This seems weird to me and redundant. Why would they set up the model in this way?
If we were to block out cat. "the ... chased the dog." Would we then, during inference, only need to send the "..." embedding through the model?
The embeddings don't contain any information about the other embeddings around them. BERT and other models like OpenGPT/GPT2 don't have context dependent inputs.
The context related part comes later. What they do in attention based models is use these input embeddings to create other vectors which then interact with each other and using various matrix multiplications, summing, normalizing and this helps the model understand the context which in turn helps it do interesting things including language generation etc.
When you say ' I would have expected to see a point in the model where the embedding for "cat" is combined with the embedding for "dog", in order to create the attention mask.', you are right. That does happen. Just not at the embedding level. We make more vectors by matrix multiplying the embeddings with learned matrices that then interact with each other.