I have looked every where, stood up a few nights and looked at many different back-propagation implementations(on stack-overflow as well) for this exact problem but I can't seem to understand how they work.
I am currently enrolled in Andrew Ng's coursera machine learning course and it's great but the back-prop implementation shown in that course is very different from what I'm seeing around the internet.
I'm having problems understanding the dimensions and computing the deltas for each weight. I would really appreciate it if someone could give me a run down of exactly what happens in back-propagation. I do not have a problem in forward prop.
Here's my code(Skip to the first for loop).
import numpy as np
import sys
x_train = np.array([
[1, 0, 1],
[1, 1, 0],
[1, 1, 1],
[1, 0, 0]
])
y_train = np.array([
[1],
[1],
[0],
[0]
])
learning_rate = 0.03
reg_param = 0.5
num_h_units = 5
max_iter = 60000 # for gradient descent
m = 4 # training
np.random.seed(1)
weights1 = np.random.random((x_train.shape[1], num_h_units)) # 3x5 (Including bias)
weights2 = np.random.random((num_h_units + 1, 1)) # 6x1 (Including bias)
def sigmoid(z, derv=False):
if derv: return z * (1 - z)
return (1 / (1 + np.exp(-z)))
def forward(x, predict=False):
a1 = x # 1x3
a1.shape = (1, a1.shape[0]) # Reshaping now, to avoid reshaping the other activations.
a2 = np.insert(sigmoid(a1.dot(weights1)), 0, 1, axis=1) # 1x3 * 3x5 = 1x5 + bias = 1x6
a3 = sigmoid(a2.dot(weights2)) # 1x6 * 6x1 = 1x1
if predict: return a3
return (a1, a2, a3)
w_grad1 = 0
w_grad2 = 0
for i in range(max_iter):
for j in range(m):
sys.stdout.write("\rIteration: {} and {}".format(i + 1, j + 1))
a1, a2, a3 = forward(x_train[j])
delta3 = np.multiply((a3 - y_train[j]), sigmoid(a3, derv=True)) # 1x1
# (1x6 * 1x1) .* 1x6 = 1x6 (Here, ".*" stands for element wise mult)
delta2 = np.multiply((weights2.T * delta3), sigmoid(a2, derv=True))
delta2 = delta2[:, 1:] # Getting rid of the bias value since that shouldn't be updated.
# 3x1 * 1x5 = 3x5 (Gradient of all the weight values for weights connecting input to hidden)
w_grad1 += (1 / m) * a1.T.dot(delta2)
# 6x1 * 1x1 = 6x1 (Updating the bias as well. If bias is removed, dimensions don't match)
a2[:, 0] = 0
w_grad2 += (1 / m) * a2.T.dot(delta3)
sys.stdout.flush() # Updating the text.
weights1 -= learning_rate * w_grad1
weights2 -= learning_rate * w_grad2
# Outputting all the outputs at once.
a1_full = x_train
a2_full = np.insert(sigmoid(a1_full.dot(weights1)), 0, 1, axis=1)
a3_full = sigmoid(a2_full.dot(weights2))
print(a3_full)
Here's the output I'm getting:
I also don't understand the following:
I am extremely lost on this so thank you in advance. I thought I understood back-propagation but implementing it has been an absolute nightmare.
This is the best intuitive explanation of backprop that I know of.Highly recommended.
1.What loss function are you using? If you use cross-entropy loss(the one with log in it) then the delta3 has to be just (a3 - target).For Least squares loss,the other one is correct.Use only (a3 - y_train[j]) in your code.
2.No the learning rate and 1/m are not optional.
3.The biases are supposed to be updated always.
4.Try to initialize the biases and weights seperatedly.I find it much easier to understand.
Example Forward pass:
Z1 = Weights*X + biases
A1 = sigmoid(Z1)
Refer to this notebook .I have implemented the exact same thing using just numpy and it works.
Corrections:
delta3 = a3 - y_train[j]
delta2 = np.multiply((weights2.T * delta3), sigmoid_prime(z1))
where sigmoid_prime is:
def sigmoid_prime(z):
return sigmoid(z)*(1-sigmoid(z))
and z1 is a1.dot(weights1)
.Your feed forward function needs to return this value also so that you can use it here.
Also since you are using stochastic gradient descent(and not mini-batch gradient descent) your m is actually 1 here.So you should remove the 1/m term.
Initialize the weights using np.random.normal and not np.random.random
Do not get rid of the bias terms.
Read up on back-prop on the link above and also here