So, I'm hoping this is a real dumb thing I'm doing, and there's an easy answer. I'm trying to train a 2x3x1 neural network to do the XOR problem. It wasn't working, so I decided to dig in to see what was happening. Finally, I decided to assign the weights my self. This was the weight vector I came up with:
theta1 = [11 0 -5; 0 12 -7;18 17 -20];
theta2 = [14 13 -28 -6];
(In Matlab notation). I deliberately tried to make no two weights be the same (barring the zeros)
And, my code, really simple in matlab is
function layer2 = xornn(iters)
if nargin < 1
iters = 50
function s = sigmoid(X)
s = 1.0 ./ (1.0 + exp(-X));
T = [0 1 1 0];
X = [0 0 1 1; 0 1 0 1; 1 1 1 1];
theta1 = [11 0 -5; 0 12 -7;18 17 -20];
theta2 = [14 13 -28 -6];
for i = [1:iters]
layer1 = [sigmoid(theta1 * X); 1 1 1 1];
layer2 = sigmoid(theta2 * layer1)
delta2 = T - layer2;
delta1 = layer1 .* (1-layer1) .* (theta2' * delta2);
% remove the bias from delta 1. There's no real point in a delta on the bias.
delta1 = delta1(1:3,:);
theta2d = delta2 * layer1';
theta1d = delta1 * X';
theta1 = theta1 - 0.1 * theta1d;
theta2 = theta2 - 0.1 * theta2d;
I believe that's right. I tested various parameters (of the thetas) with the finite differences method to see if they were right, and they seemed to be.
But, when I run it, it eventually just all boils down to returning all zeros. If I do xornn(1) (for 1 iteration) I get
0.0027 0.9966 0.9904 0.0008
But, if I do xornn(35)
0.0026 0.9949 0.9572 0.0007
(It's started a descent in the wrong direction) and by the time I get to xornn(45) I get
0.0018 0.0975 0.0000 0.0003
If I run it for 10,000 iterations, it just returns all 0's.
What is going on? Must I add regularization? I would have thought such a simple network wouldn't need it. But, regardless, why does it move away from an obvious good solution that I have hand fed it?
AAARRGGHHH! The solution was simply a matter of changing
theta1 = theta1 - 0.1 * theta1d;
theta2 = theta2 - 0.1 * theta2d;
theta1 = theta1 + 0.1 * theta1d;
theta2 = theta2 + 0.1 * theta2d;
Now tho, I need to figure out how I'm computing the negative derivative somehow when what I thought I was computing was the ... Never mind. I'll post here anyway, just in case it helps someone else.
So, z = is the sum of inputs to the sigmoid, and y is the output of the sigmoid.
C = -(T * Log[y] + (1-T) * Log[(1-y))
dC/dy = -((T/y) - (1-T)/(1-y))
= -((T(1-y)-y(1-T))/(y(1-y)))
= -((T-Ty-y+Ty)/(y(1-y)))
= -((T-y)/(y(1-y)))
= ((y-T)/(y(1-y))) # This is the source of all my woes.
dy/dz = y(1-y)
dC/dz = ((y-T)/(y(1-y))) * y(1-y)
= (y-T)
So, the problem, is that I accidentally was computing T-y, because I forgot about the negative sign in front of the cost function. Then, I was subtracting what I thought was the gradient, but was in fact the negative gradient. And, there. That's the problem.
Once I did that:
function layer2 = xornn(iters)
if nargin < 1
iters = 50
function s = sigmoid(X)
s = 1.0 ./ (1.0 + exp(-X));
T = [0 1 1 0];
X = [0 0 1 1; 0 1 0 1; 1 1 1 1];
theta1 = [11 0 -5; 0 12 -7;18 17 -20];
theta2 = [14 13 -28 -6];
for i = [1:iters]
layer1 = [sigmoid(theta1 * X); 1 1 1 1];
layer2 = sigmoid(theta2 * layer1)
delta2 = T - layer2;
delta1 = layer1 .* (1-layer1) .* (theta2' * delta2);
% remove the bias from delta 1. There's no real point in a delta on the bias.
delta1 = delta1(1:3,:);
theta2d = delta2 * layer1';
theta1d = delta1 * X';
theta1 = theta1 + 0.1 * theta1d;
theta2 = theta2 + 0.1 * theta2d;
xornn(50) returns 0.0028 0.9972 0.9948 0.0009 and xornn(10000) returns 0.0016 0.9989 0.9993 0.0005
Phew! Maybe this will help someone else in debugging their version..