BERT Heads Count

From the literature I read,

Bert Base has 12 encoder layers and 12 attention heads. Bert Large has 24 encoder layers and 16 attention heads.

Why is Bert large having 16 attentions heads ?

Solution

The number of attention heads is irrespective of the number of (encoder) layers. However, there is an inherent tie between the hidden size of each model (768 for bert-base, and 1024 for bert-large), which is explained in the original Transformers paper. Essentially, the choice made by the authors is that the self-attention block size (d_k) equals the hidden dimension (d_hidden), divided by the number of heads (h), or formally

d_k = d_hidden / h

Since the standard choice seems to be d_k = 64, we can infer the final size from our parameters:

h = d_hidden / d_k = 1024 / 64 = 16

which is exactly the value you are looking at in bert-large.