Search code examples
machine-learningdeep-learningpytorchnlpattention-model

Layernorm in PyTorch


Consider the following example:

    batch, sentence_length, embedding_dim = 2, 3, 4
    embedding = torch.randn(batch, sentence_length, embedding_dim)
    print(embedding)
    
# Output:
    tensor([[[-2.1918,  1.2574, -0.3838,  1.3870],
             [-0.4043,  1.2972, -1.7326,  0.4047],
             [ 0.4560,  0.6482,  1.0858,  2.2086]],
    
            [[-1.4964,  0.3722, -0.7766,  0.3062],
             [ 0.9812,  0.1709, -0.9177, -1.2558],
             [-1.1560, -0.0367,  0.5496, -1.1142]]])

Applying the Layernorm which normalized across the embedding dimension, I get:

layer_norm = torch.nn.LayerNorm(embedding_dim)
layer_norm(embedding)

# Output:
tensor([[[-1.5194,  0.8530, -0.2758,  0.9422],
         [-0.2653,  1.2620, -1.4576,  0.4609],
         [-0.9470, -0.6641, -0.0204,  1.6315]],

        [[-1.4058,  0.9872, -0.4840,  0.9026],
         [ 1.3933,  0.4803, -0.7463, -1.1273],
         [-0.9869,  0.5545,  1.3619, -0.9294]]],
       grad_fn=<NativeLayerNormBackward0>)

Now, when I normalize the first vector of above embedding tensor with a naive python implementation, I get:

    a = [-2.1918,  1.2574, -0.3838,  1.3870]
    mean_a = statistics.mean(a)
    var_a = statistics.stdev(a)
    eps = 1e-5
    d = [ ((i-mean_a)/math.sqrt(var_a + eps)) for i in a]
    print(d)
    
    #Output:
[-1.7048934056508998,0.9571791768620398,-0.3094894774404756,1.0572037062293356]

The normalized values are not the same as what I get from PyTorch's Layernorm. Is there something wrong with the way I calculated Layernorm?


Solution

  • What you want is the variance not the standard deviation (the standard deviation is the sqrt of the variance, and you're getting the sqrt in your calculation of d). Also, this uses the biased variance (statistics.pvariance). To reproduce the expected results using the statistics module you'll use:

    a = [-2.1918,  1.2574, -0.3838,  1.3870]
    mean_a = statistics.mean(a)
    var_a = statistics.pvariance(a)
    eps = 1e-5
    d = [ ((i-mean_a)/math.sqrt(var_a + eps)) for i in a]
    print(d)
    [-1.519391435327454, 0.8530327107709863, -0.2758152854532861, 0.942174010009754]
    

    Another way to verify correct results is:

    [[torch.mean(i).item(), torch.var(i, unbiased=False).item()] for i in layer_norm(embedding)]
    
    [[1.9868215517249155e-08, 0.9999885559082031],
    [-1.9868215517249155e-08, 0.9999839663505554]]
    

    This shows that the mean and variance of the normalized embeddings are (very close to) 0 and 1, as expected.

    relevant doc: "The standard-deviation is calculated via the biased estimator, equivalent to torch.var(input, unbiased=False)."