XOR with Neural Networks (Matlab)

So, I'm hoping this is a real dumb thing I'm doing, and there's an easy answer. I'm trying to train a 2x3x1 neural network to do the XOR problem. It wasn't working, so I decided to dig in to see what was happening. Finally, I decided to assign the weights my self. This was the weight vector I came up with:

theta1 = [11 0 -5; 0 12 -7;18 17 -20];
theta2 = [14 13 -28 -6];

(In Matlab notation). I deliberately tried to make no two weights be the same (barring the zeros)

And, my code, really simple in matlab is

function layer2 = xornn(iters)
    if nargin < 1
        iters = 50
    end
    function s = sigmoid(X)
        s = 1.0 ./ (1.0 + exp(-X));
    end
    T = [0 1 1 0];
    X = [0 0 1 1; 0 1 0 1; 1 1 1 1];
    theta1 = [11 0 -5; 0 12 -7;18 17 -20];
    theta2 = [14 13 -28 -6];
    for i = [1:iters]
        layer1 = [sigmoid(theta1 * X); 1 1 1 1];
        layer2 = sigmoid(theta2 * layer1)
        delta2 = T - layer2;
        delta1 = layer1 .* (1-layer1) .* (theta2' * delta2);
        % remove the bias from delta 1. There's no real point in a delta on the bias.
        delta1 = delta1(1:3,:);
        theta2d = delta2 * layer1';
        theta1d = delta1 * X';
        theta1 = theta1 - 0.1 * theta1d;
        theta2 = theta2 - 0.1 * theta2d;
    end
end

I believe that's right. I tested various parameters (of the thetas) with the finite differences method to see if they were right, and they seemed to be.

But, when I run it, it eventually just all boils down to returning all zeros. If I do xornn(1) (for 1 iteration) I get

0.0027    0.9966    0.9904    0.0008

But, if I do xornn(35)

0.0026    0.9949    0.9572    0.0007

(It's started a descent in the wrong direction) and by the time I get to xornn(45) I get

0.0018    0.0975    0.0000    0.0003

If I run it for 10,000 iterations, it just returns all 0's.

What is going on? Must I add regularization? I would have thought such a simple network wouldn't need it. But, regardless, why does it move away from an obvious good solution that I have hand fed it?

Thanks!

Solution

AAARRGGHHH! The solution was simply a matter of changing

theta1 = theta1 - 0.1 * theta1d;
theta2 = theta2 - 0.1 * theta2d;

theta1 = theta1 + 0.1 * theta1d;
theta2 = theta2 + 0.1 * theta2d;

sigh

Now tho, I need to figure out how I'm computing the negative derivative somehow when what I thought I was computing was the ... Never mind. I'll post here anyway, just in case it helps someone else.

So, z = is the sum of inputs to the sigmoid, and y is the output of the sigmoid.

C = -(T * Log[y] + (1-T) * Log[(1-y))

dC/dy = -((T/y) - (1-T)/(1-y))
      = -((T(1-y)-y(1-T))/(y(1-y)))
      = -((T-Ty-y+Ty)/(y(1-y)))
      = -((T-y)/(y(1-y)))
      = ((y-T)/(y(1-y))) # This is the source of all my woes.
dy/dz = y(1-y)
dC/dz = ((y-T)/(y(1-y))) * y(1-y)
      = (y-T)

So, the problem, is that I accidentally was computing T-y, because I forgot about the negative sign in front of the cost function. Then, I was subtracting what I thought was the gradient, but was in fact the negative gradient. And, there. That's the problem.

Once I did that:

function layer2 = xornn(iters)
    if nargin < 1
        iters = 50
    end
    function s = sigmoid(X)
        s = 1.0 ./ (1.0 + exp(-X));
    end
    T = [0 1 1 0];
    X = [0 0 1 1; 0 1 0 1; 1 1 1 1];
    theta1 = [11 0 -5; 0 12 -7;18 17 -20];
    theta2 = [14 13 -28 -6];
    for i = [1:iters]
        layer1 = [sigmoid(theta1 * X); 1 1 1 1];
        layer2 = sigmoid(theta2 * layer1)
        delta2 = T - layer2;
        delta1 = layer1 .* (1-layer1) .* (theta2' * delta2);
        % remove the bias from delta 1. There's no real point in a delta on the bias.
        delta1 = delta1(1:3,:);
        theta2d = delta2 * layer1';
        theta1d = delta1 * X';
        theta1 = theta1 + 0.1 * theta1d;
        theta2 = theta2 + 0.1 * theta2d;
    end
end

xornn(50) returns 0.0028 0.9972 0.9948 0.0009 and xornn(10000) returns 0.0016 0.9989 0.9993 0.0005

Phew! Maybe this will help someone else in debugging their version..