Tried to find the source code of multihead attention but could not find any implementation details. I wonder if this module only contains the attention part rather than the whole transformer block (i.e. It does not contain the normalisation layer, residual connection and an additional feedforward neural network)?
According to the source code, the answer is no. MultiheadAttention
unsurprisingly implements only the attention function.