python machine-learning neural-network artificial-intelligence backpropagation

Backpropagation implementation not converging with XOR Dataset

I have looked every where, stood up a few nights and looked at many different back-propagation implementations(on stack-overflow as well) for this exact problem but I can't seem to understand how they work.

I am currently enrolled in Andrew Ng's coursera machine learning course and it's great but the back-prop implementation shown in that course is very different from what I'm seeing around the internet.

I'm having problems understanding the dimensions and computing the deltas for each weight. I would really appreciate it if someone could give me a run down of exactly what happens in back-propagation. I do not have a problem in forward prop.

Here's my code(Skip to the first for loop).

import numpy as np
import sys

x_train = np.array([
    [1, 0, 1],
    [1, 1, 0],
    [1, 1, 1],
    [1, 0, 0]
])
y_train = np.array([
    [1],
    [1],
   [0],
   [0]
])

learning_rate = 0.03
reg_param = 0.5
num_h_units = 5
max_iter = 60000 # for gradient descent
m = 4 # training

np.random.seed(1)
weights1 = np.random.random((x_train.shape[1], num_h_units)) # 3x5 (Including bias)
weights2 = np.random.random((num_h_units + 1, 1)) # 6x1 (Including bias)

def sigmoid(z, derv=False):
    if derv: return z * (1 - z)
    return (1 / (1 + np.exp(-z)))

def forward(x, predict=False):
    a1 = x # 1x3
    a1.shape = (1, a1.shape[0]) # Reshaping now, to avoid reshaping the other activations.
    a2 = np.insert(sigmoid(a1.dot(weights1)), 0, 1, axis=1) # 1x3 * 3x5 = 1x5 + bias = 1x6
    a3 = sigmoid(a2.dot(weights2)) # 1x6 * 6x1 = 1x1

    if predict: return a3
    return (a1, a2, a3)


w_grad1 = 0
w_grad2 = 0
for i in range(max_iter):
    for j in range(m):
        sys.stdout.write("\rIteration: {} and {}".format(i + 1, j + 1))
        a1, a2, a3 = forward(x_train[j])

        delta3 = np.multiply((a3 - y_train[j]), sigmoid(a3, derv=True)) # 1x1

        # (1x6 * 1x1) .* 1x6 = 1x6 (Here, ".*" stands for element wise mult)
        delta2 = np.multiply((weights2.T * delta3), sigmoid(a2, derv=True))
        delta2 = delta2[:, 1:] # Getting rid of the bias value since that shouldn't be updated.

        # 3x1 * 1x5 = 3x5 (Gradient of all the weight values for weights connecting input to hidden)
        w_grad1 += (1 / m) * a1.T.dot(delta2)

        # 6x1 * 1x1 = 6x1 (Updating the bias as well. If bias is removed, dimensions don't match)
        a2[:, 0] = 0
        w_grad2 += (1 / m) * a2.T.dot(delta3)
        sys.stdout.flush() # Updating the text.
    weights1 -= learning_rate * w_grad1
    weights2 -= learning_rate * w_grad2


# Outputting all the outputs at once.
a1_full = x_train
a2_full = np.insert(sigmoid(a1_full.dot(weights1)), 0, 1, axis=1)
a3_full = sigmoid(a2_full.dot(weights2))
print(a3_full)

Here's the output I'm getting:

I also don't understand the following:

In the coursera course, delta3 is computed by just doing: a3 - target but is other places I've seen delta3 being calculated (a3 - target) * sigmoid(a3, derv=True). I'm confused, which one's correct? and why?
In many implementations, the person did not use learning_rate and (1 / m) to step down the grad. Are the learning_rate and (1 / m) optional?
What are we supposed to do with the biases? Update them? Not update them? In many other implementations, I've seen people just updating the biases as well.
Is there a set position where the biases should be? Like as the first columns or the last column. etc.
Do I need to do np.insert() to add the bias column to the computation?

I am extremely lost on this so thank you in advance. I thought I understood back-propagation but implementing it has been an absolute nightmare.

Solution

This is the best intuitive explanation of backprop that I know of.Highly recommended.

1.What loss function are you using? If you use cross-entropy loss(the one with log in it) then the delta3 has to be just (a3 - target).For Least squares loss,the other one is correct.Use only (a3 - y_train[j]) in your code.

2.No the learning rate and 1/m are not optional.

3.The biases are supposed to be updated always.

4.Try to initialize the biases and weights seperatedly.I find it much easier to understand.

Example Forward pass:

Z1 = Weights*X + biases

A1 = sigmoid(Z1)

Refer to this notebook .I have implemented the exact same thing using just numpy and it works.

Corrections:

delta3 = a3 - y_train[j]

delta2 = np.multiply((weights2.T * delta3), sigmoid_prime(z1))

where sigmoid_prime is:

def sigmoid_prime(z):
    return sigmoid(z)*(1-sigmoid(z))

and z1 is a1.dot(weights1).Your feed forward function needs to return this value also so that you can use it here.

Also since you are using stochastic gradient descent(and not mini-batch gradient descent) your m is actually 1 here.So you should remove the 1/m term.

Initialize the weights using np.random.normal and not np.random.random

Do not get rid of the bias terms.

Read up on back-prop on the link above and also here