I'm attempting to create a multilayer feedforward backpropagation neural network to recognize handwritten digits and I'm running into a problem where the activations in my output layer all tend towards the same value.
I'm using the Optical Recognition of Handwritten Digits Data Set, with training data that looks like
which represents an 8x8 matrix, where each of the 64 integers corresponds to the number of dark pixels in a sub-4x4 matrix, with the last integer being the classification.
I'm using 64 nodes in the input layer corresponding to the 64 integers, some number of hidden nodes in some number of hidden layers, and 10 nodes in the output layer corresponding to 0-9.
My weights are initialized here, and biases are added for the input layer and hidden layers
self.weights = []
for i in xrange(1, len(layers) - 1):
size=(layers[i-1] + 1, layers[i] + 1)))
# Output weights
size=(layers[-2] + 1, layers[-1])))
where list
contains the number of nodes in each layer, e.g.
layers=[64, 30, 10]
I'm using the logistic function as my activation function
def logistic(self, z):
return sp.expit(z)
and its derivative
def derivative(self, z):
return sp.expit(z) * (1 - sp.expit(z))
My backpropagation algorithm is borrowed heavily from here; my previous attempts failed so I wanted to try another route.
def back_prop_learning(self, X, y):
# add biases to inputs with value of 1
biases = np.atleast_2d(np.ones(X.shape[0]))
X = np.concatenate((biases.T, X), axis=1)
# Iterate over training set
for epoch in xrange(self.epochs):
# for each weight w[i][j] in network assign random tiny values
# handled in __init__
for example in zip(X, y):
# for each node i in the input layer
# set input layer outputs equal to input vector outputs
activations = [example[0]]
# for layer = 1 (first hidden) to output layer
for layer in xrange(len(self.weights)):
# for each node j in layer
weighted_sum = np.dot(activations[layer], self.weights[layer])
# assert number of outputs == number of weights in each layer
assert(len(activations[layer]) == len(self.weights[layer]))
# compute activation of weighted sum of node j
activation = self.logistic(weighted_sum)
# append vector of activations
# for each node j in the output layer
# compute error of target - output
errors = example[1] - activations[-1]
# multiply by derivative
deltas = [errors * self.derivative(activations[-1])]
# for layer = last hidden layer down to first hidden layer
for layer in xrange(len(activations)-2, 0, -1):
deltas.append(deltas[-1].dot(self.weights[layer].T) * self.derivative(activations[layer]))
# for each weight w[i][j] in network
for i in xrange(len(self.weights)):
layer = np.atleast_2d(activations[i])
delta = np.atleast_2d(deltas[i])
self.weights[i] += self.alpha * layer.T.dot(delta)
And my outputs after running testing data all resemble
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 9.0
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 4.0
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 6.0
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 6.0
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 7.0
No matter what I select for my learning rate, number of hidden nodes, or number of hidden layers, everything seems to tend towards 1. Which leaves me wondering whether I'm even approaching and setting up the problem correctly, with 64 inputs to 10 outputs, or whether I've selected/implemented my sigmoid function correctly, or whether the failure is in my implementation of my backpropagation algorithm. I've recreated the above program two or three times with the same results, which leads me to believe that I'm fundamentally misunderstanding the problem and not representing it correctly.
I think I've answered my question.
I believe the problem was how I was calculating my errors in the output layer. I had been calculating it as errors = example[1] - activations[-1]
, which created an array of errors resulting from subtracting my output layer activations from the target value.
I changed this so that my target values were a vector of zeros, 0-9, so that my the index of my target value was 1.0.
y = int(example[1])
errors_v = np.zeros(shape=(10,), dtype=float)
errors_v[y] = 1.0
errors = errors_v - activations[-1]
I also changed my activation function to be the tanh function.
This has significantly increased the variance in the activations in my output layer and I've been able to achieve 50% - 75% accuracy in my limited testing so far. Hopefully this helps someone else.