python pytorch bert-language-model transformer-model attention-model

Does torch.nn.MultiheadAttention contain normalisation layer and feed forward layer?

Tried to find the source code of multihead attention but could not find any implementation details. I wonder if this module only contains the attention part rather than the whole transformer block (i.e. It does not contain the normalisation layer, residual connection and an additional feedforward neural network)?

Solution

According to the source code, the answer is no. MultiheadAttention unsurprisingly implements only the attention function.