Search code examples
tensorflowkeraskeras-layerdropout

Why does non-zero values change in Keras Dropout?


Suppose I have a tensor:

x = tf.reshape(tf.constant(tf.range(1, 21, dtype=tf.float32)), (5,4))

<tf.Tensor: id=1080557, shape=(5, 4), dtype=float32, numpy=
array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.],
       [ 9., 10., 11., 12.],
       [13., 14., 15., 16.],
       [17., 18., 19., 20.]], dtype=float32)>

And I apply dropout to it:

dropout = tf.keras.layers.Dropout(0.1)
dropout(x, training=True)

<tf.Tensor: id=1080704, shape=(5, 4), dtype=float32, numpy=
array([[ 1.1111112,  2.2222223,  3.3333335,  0.       ],
       [ 5.555556 ,  6.666667 ,  7.777778 ,  8.888889 ],
       [10.       , 11.111112 , 12.222223 ,  0.       ],
       [14.444445 , 15.555556 , 16.666668 , 17.777779 ],
       [18.88889  ,  0.       , 21.111113 , 22.222223 ]], dtype=float32)>

Each time I run it I have from 1 to 3 zeroed values which is not exactly rate=0.1. What range of rates does it actually apply and why did non-zeroed values change?

To visualize Celius Stingher answer:

l = 10000; r = range(l)
f = np.zeros((5,4))
for i in r:
  d = dropout(x, training=True)
  f += d
f = f/l
f

<tf.Tensor: id=1234967, shape=(5, 4), dtype=float32, numpy=
array([[ 1.0006623,  1.999991 ,  2.988533 ,  4.017763 ],
       [ 5.000613 ,  6.0477467,  7.0076656,  8.0248575],
       [ 9.048    , 10.06455  , 10.980609 , 12.010143 ],
       [12.918334 , 14.100925 , 15.039784 , 16.014153 ],
       [17.0579   , 18.112    , 19.064175 , 20.024672 ]], dtype=float32)>

Solution

  • Because dropout works in a way that every neuron is set to 0 with a probability equal to the value you are passing. You can think of it as a binomial distribution[*] with p = 0.1 and n = 20 the expected value equals 2 and the Standard deviation equals ~1.34 so that explains why most of the time you'll see between 1 and 3 neuron (values) being forced to 0. That is why you can set a random seed within the dropout function to ensure reproducibility

    [*] In this paper1 you find further detail where they assume r(j) follows a Bernoulli distribution (hence the repetition of multiple bernoulli distributions follow a binomial one).

    OP asked: Thank you. I got you, but what about other values? Why did non-zeroed values change?

    EDIT: Given how the function works, your model, and given your set of values, the expected value in the long run after applying dropout has to equal the values before applying them. So if you run this code for let's say 1000 iterations, I expect the average of the average value for each run to tend to 10.5 (or 210 total). The only to achieve this is to increase each value by the same rate as the dropout. If you get the worse case you'll drop the last 3 numbers, but in the best case you'll drop the first 3 the average of both averages is 10.5, the initial average values. The interpretation is from the paper I linked. (P.1933 )