Search code examples
neural-networknlpbert-language-model

How is the number of parameters be calculated in BERT model?


The paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" by Devlin & Co. calculated for the base model size 110M parameters (i.e. L=12, H=768, A=12) where L = number of layers, H = hidden size and A = number of self-attention operations. As far as I know parameters in a neural network are usually the count of "weights and biases" between the layers. So how is this calculated based on the given information? 12768768*12?


Solution

  • Transformer Encoder-Decoder Architecture The BERT model contains only the encoder block of the transformer architecture. Let's look at individual elements of an encoder block for BERT to visualize the number weight matrices as well as the bias vectors. The given configuration L = 12 means there will be 12 layers of self attention, H = 768 means that the embedding dimension of individual tokens will be of 768 dimensions, A = 12 means there will be 12 attention heads in one layer of self attention. The encoder block performs the following sequence of operations:

    1. The input will be the sequence of tokens as a matrix of S * d dimension. Where s is the sequence length and d is the embedding dimension. The resultant input sequence will be the sum of token embeddings, token type embeddings as well as position embedding as a d-dimensional vector for each token. In the BERT model, the first set of parameters is the vocabulary embeddings. BERT uses WordPiece[2] embeddings that has 30522 tokens. Each token is of 768 dimensions.

    2. Embedding layer normalization. One weight matrix and one bias vector.

    3. Multi-head self attention. There will be h number of heads, and for each head there will be three matrices which will correspond to query matrix, key matrix and the value matrix. The first dimension of these matrices will be the embedding dimension and the second dimension will be the embedding dimension divided by the number of attention heads. Apart from this, there will be one more matrix to transform the concatenated values generated by attention heads to the final token representation.

    4. Residual connection and layer normalization. One weight matrix and one bias vector.

    5. Position-wise feedforward network will have one hidden layer, that will correspond to two weight matrices and two bias vectors. In the paper, it is mentioned that the number of units in the hidden layer will be four times the embedding dimension.

    6. Residual connection and layer normalization. One weight matrix and one bias vector.

    Let's calculate the actual number of parameters by associating the right dimensions to the weight matrices and bias vectors for the BERT base model.

    Embedding Matrices:

    • Word Embedding Matrix size [Vocabulary size, embedding dimension] = [30522, 768] = 23440896
    • Position embedding matrix size, [Maximum sequence length, embedding dimension] = [512, 768] = 393216
    • Token Type Embedding matrix size [2, 768] = 1536
    • Embedding Layer Normalization, weight and Bias [768] + [768] = 1536
    • Total Embedding parameters = 𝟐𝟑𝟖𝟑𝟕𝟏𝟖𝟒 ≈ 𝟐𝟒𝑴

    Attention Head:

    • Query Weight Matrix size [768, 64] = 49152 and Bias [768] = 768

    • Key Weight Matrix size [768, 64] = 49152 and Bias [768] = 768

    • Value Weight Matrix size [768, 64] = 49152 and Bias [768] = 768

    • Total parameters for one layer attention with 12 heads = 12∗(3 ∗(49152+768)) = 1797120

    • Dense weight for projection after concatenation of heads [768, 768] = 589824 and Bias [768] = 768, (589824+768 = 590592)

    • Layer Normalization weight and Bias [768], [768] = 1536

    • Position wise feedforward network weight matrices and bias [3072, 768] = 2359296, [3072] = 3072 and [768, 3072 ] = 2359296, [768] = 768, (2359296+3072+ 2359296+768 = 4722432)

    • Layer Normalization weight and Bias [768], [768] = 1536

    • Total parameters for one complete attention layer (1797120 + 590592 + 1536 + 4722432 + 1536 = 7113216 ≈ 7𝑀)

    • Total parameters for 12 layers of attention (𝟏𝟐 ∗ 𝟕𝟏𝟏𝟑𝟐𝟏𝟔 = 𝟖𝟓𝟑𝟓𝟖𝟓𝟗𝟐 ≈ 𝟖𝟓𝑴)

    Output layer of BERT Encoder:

    • Dense Weight Matrix and Bias [768, 768] = 589824, [768] = 768, (589824 + 768 = 590592)

    Total Parameters in 𝑩𝑬𝑹𝑻 𝑩ase = 𝟐𝟑𝟖𝟑𝟕𝟏𝟖𝟒 + 𝟖𝟓𝟑𝟓𝟖𝟓𝟗𝟐 + 𝟓𝟗𝟎𝟓𝟗𝟐 = 𝟏𝟎𝟗𝟕𝟖𝟔𝟑𝟔𝟖 ≈ 𝟏𝟏𝟎𝑴