I can't understand why dropout works like this in tensorflow. The blog of CS231n says that, "dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise."
Also you can see this from picture(Taken from the same site)
From tensorflow site, With probability keep_prob, outputs the input element scaled up by 1 / keep_prob, otherwise outputs 0.
Now, why the input element is scaled up by 1/keep_prob
? Why not keep the input element as it is with probability and not scale it with 1/keep_prob
?
This scaling enables the same network to be used for training (with keep_prob < 1.0
) and evaluation (with keep_prob == 1.0
). From the Dropout paper:
The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2.
Rather than adding ops to scale down the weights by keep_prob
at test time, the TensorFlow implementation adds an op to scale up the weights by 1. / keep_prob
at training time. The effect on performance is negligible, and the code is simpler (because we use the same graph and treat keep_prob
as a tf.placeholder()
that is fed a different value depending on whether we are training or evaluating the network).