Neural network with sigmoid neurons does not learn if a factor is added to all weights and biases after initialization

I'm about to experiment with a neural network for handwriting recognition, which can be found here: https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network.py If the weights and biases are randomly initialized, it recognizes over 80% of the digits after a few epochs. If I add a small factor of 0.27 to all weights and biases after initialization, learning is much slower, but eventually it reaches the same accuracy of over 80%:

self.biases = [np.random.randn(y, 1)+0.27 for y in sizes[1:]]
self.weights = [np.random.randn(y, x)+0.27 for x, y in zip(sizes[:-1], sizes[1:])]

Epoch 0 : 205 / 2000
Epoch 1 : 205 / 2000
Epoch 2 : 205 / 2000
Epoch 3 : 219 / 2000
Epoch 4 : 217 / 2000
...
Epoch 95 : 1699 / 2000
Epoch 96 : 1706 / 2000
Epoch 97 : 1711 / 2000
Epoch 98 : 1708 / 2000
Epoch 99 : 1730 / 2000

If I add a small factor of 0.28 to all weights and biases after initialization, the network isn't learning at all anymore.

self.biases = [np.random.randn(y, 1)+0.28 for y in sizes[1:]]
self.weights = [np.random.randn(y, x)+0.28 for x, y in zip(sizes[:-1], sizes[1:])]

Epoch 0 : 207 / 2000
Epoch 1 : 209 / 2000
Epoch 2 : 209 / 2000
Epoch 3 : 209 / 2000
Epoch 4 : 209 / 2000
...
Epoch 145 : 234 / 2000
Epoch 146 : 234 / 2000
Epoch 147 : 429 / 2000
Epoch 148 : 234 / 2000
Epoch 149 : 234 / 2000

I think this has to to with the sigmoid function which gets very flat when close to one and zero. But what happens at this point when the mean of the weights and biases is 0.28? Why is there such a steep drop in the number of recognized digits? And why are there outliers like the 429 above?

Solution

Initialization plays a big role in training networks. A good initialization can make training and convergence a lot faster, while a bad one can make it many times slower. It can even allow or prevent convergence at all.

You might want to read this fr some more information on the topic
https://towardsdatascience.com/weight-initialization-in-neural-networks-a-journey-from-the-basics-to-kaiming-954fb9b47c79

By adding 0.27 to all weights and biases you probably shift the network away from the optimal solution and increase the gradients. Depending on the layer count this can lead to exploding gradients. Now you have very big updates of weights every iteration. What could be happening is that you have some weight that is 0.3 (after adding 0.27 to it) and we say the optimal value would be 0.1. Now you get a update with -0.4, now you are at -0.1. The next update might be 0.4 (or something close) and you are back at the original problem. So instead of going slow towards the optimal value, the optimizations just overshoots everything and bounces back and forth. This might be fixed after some time or can lead to no convergence at all since the network just bounces around.

Also in general you want biases to be initialized to 0 or very close to zero. If you try this further you might want to try not adding 0.27 to biases and setting them to 0 or something close to 0 initially. Maybe by doing this it can actually learn again.