This is from one of the tensorflow examples mnist_softmax.py.
Even though the gradients are non-zero, they must be identical and all the ten weight vectors corresponding to the ten classes should be exactly same and produce the same output logits and hence same probabilities. The only case I could think this is possible is while calculating the accuracy using tf.argmax(), whose output is ambiguous in case of ties, we are getting lucky and resulting in 92% accuracy. But then I checked the values of y after training is complete and they give perfectly different outputs indicating the weight vectors of all classes are not same. Can someone explain how this is possible?
Although it is best to initialize the parameters to small random numbers to break symmetry and possibly accelerate learning, it does not necessarily mean you will get same probabilities for all classes if you initialize the weights to zeros.
The reason is because the cross_entropy
function is a function of weights, inputs, and correct class labels. So the gradient will be different for each output 'neuron', depending on the correct class label, and this will break the symmetry.