python python-3.x keras deep-learning word-embedding

Using Dropout on output of embedding layer changes array values, Why?

Observing the outputs of embedding layer with and without dropout shows that values in the arrays are replaced with 0. But along with this why other values of array changed ?

Following is my model:-

input = Input(shape=(23,)) 
model = Embedding(input_dim=n_words, output_dim=23, input_length=23)(input)
model = Dropout(0.2)(model)
model = Bidirectional(LSTM(units=LSTM_N, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(n_tags, activation="softmax"))(model) # softmax output layer
model = Model(input, out)

Building model2 from trained model , with input as the input layer and output as the output of Dropout(0.2) . -

from keras import backend as K
model2 = K.function([model.layers[0].input ,  K.learning_phase()],
                  [model.layers[2].output]   )
dropout = model2([X_train[0:1] , 1])[0]
nodrop = model2([X_train[0:1] , 0])[0]

Printing the first array of both dropout and no dropout:

dropout[0][0]

Output-

array([ 0.        , -0.        , -0.        , -0.04656423, -0.        ,
        0.28391626,  0.12213208, -0.01187495, -0.02078421, -0.        ,
        0.10585815, -0.        ,  0.27178472, -0.21080771,  0.        ,
       -0.09336889,  0.07441022,  0.02960865, -0.2755439 , -0.11252255,
       -0.04330419, -0.        ,  0.04974075], dtype=float32)

nodrop[0][0]

Output-

array([ 0.09657606, -0.06267098, -0.00049554, -0.03725138, -0.11286845,
    0.22713302,  0.09770566, -0.00949996, -0.01662737, -0.05788678,
    0.08468652, -0.22405024,  0.21742778, -0.16864617,  0.08558936,
   -0.07469511,  0.05952817,  0.02368692, -0.22043513, -0.09001804,
   -0.03464335, -0.05152775,  0.0397926 ], dtype=float32)

Some values are replaced with 0 , agreed, but why are other values changed ? As the embedding outputs have a meaning and are unique for each of the words, if these are changed by applying dropout, then is it correct to apply dropout after embedding layer ?

Note- I have used "learning_phase" as 0 and 1 for testing(nodropout) and training(droput) respectively.

Solution

It is how the dropout regularization works. After applying the dropout, the values are divided by the keeping probability (in this case 0.8).

When you use dropout, the function receives the probability of turning a neuron to zero as input, e.g., 0.2, which means it has 0.8 chance of keeping any given neuron. So, the values remaining will be multiplied by 1/(1-0.2).

This is called "inverted dropout technique" and it is done in order to ensure that the expected value of the activation remains the same. Otherwise, predictions will be wrong during inference when dropout is not used.

You'll notice that your dropout is 0.2, and all your values have been multiplied by 0.8 after you applied dropout.

Look what happens if I divide your second output bu the first:

import numpy as np
a = np.array([ 0.        , -0.        , -0.        , -0.04656423, -0.        ,
        0.28391626,  0.12213208, -0.01187495, -0.02078421, -0.        ,
        0.10585815, -0.        ,  0.27178472, -0.21080771,  0.        ,
       -0.09336889,  0.07441022,  0.02960865, -0.2755439 , -0.11252255,
       -0.04330419, -0.        ,  0.04974075])

b = np.array([ 0.09657606, -0.06267098, -0.00049554, -0.03725138, -0.11286845,
    0.22713302,  0.09770566, -0.00949996, -0.01662737, -0.05788678,
    0.08468652, -0.22405024,  0.21742778, -0.16864617,  0.08558936,
   -0.07469511,  0.05952817,  0.02368692, -0.22043513, -0.09001804,
   -0.03464335, -0.05152775,  0.0397926 ])

print(b/a)

[       inf        inf        inf 0.79999991        inf 0.80000004
 0.79999997 0.8        0.8000001         inf 0.8               inf
 0.80000001 0.80000001        inf 0.79999998 0.79999992 0.8
 0.80000004 0.8        0.79999995        inf 0.8       ]