Search code examples
nlptransformer-modelattention-model

Why use multi-headed attention in Transformers?


I am trying to understand why transformers use multiple attention heads. I found the following quote:

Instead of using a single attention function where the attention can be dominated by the actual word itself, transformers use multiple attention heads.

What is meant by "the attention being dominated by the word itself" and how does the use of multiple heads address that?


Solution

  • Multi-headed attention was introduced due to the observation that different words relate to each other in different ways. For a given word, the other words in the sentence could act as moderating or negating the meaning, but they could also express relations like inheritance (is a kind of), possession (belongs to), etc.

    I found this online lecture to be very helpful, which came up with this example:

    "The restaurant was not too terrible."

    Note that the meaning of the word 'terrible' is distorted by the two words 'too' and 'not' (too: moderation, not: inversion) and 'terrible' also relates to 'restaurant', as it expresses a property.