From the literature I read,
Bert Base has 12 encoder layers and 12 attention heads. Bert Large has 24 encoder layers and 16 attention heads.
Why is Bert large having 16 attentions heads ?
The number of attention heads is irrespective of the number of (encoder) layers.
However, there is an inherent tie between the hidden size of each model (768 for bert-base
, and 1024 for bert-large
), which is explained in the original Transformers paper.
Essentially, the choice made by the authors is that the self-attention block size (d_k
) equals the hidden dimension (d_hidden
), divided by the number of heads (h
), or formally
d_k = d_hidden / h
Since the standard choice seems to be d_k = 64
, we can infer the final size from our parameters:
h = d_hidden / d_k = 1024 / 64 = 16
which is exactly the value you are looking at in bert-large
.