How to properly implement Backpropogation

I have been trying to implement my own toy Neural Network library for learning purposes. I've tried to test it on various logic gate operations like Or, And, and XOR. While it works properly for OR operations, it is failing for AND and XOR operations. It rarely gives correct output for AND and XOR operations.

I have tried range learning rates. I've also tried various learning curves to find the cost pattern with number of epochs.


import numpy as np

class myNeuralNet:

    def __init__(self, layers = [2, 2, 1], learningRate = 0.09):
        self.layers = layers
        self.learningRate = learningRate
        self.biasses = [np.random.randn(l, 1)  for l in self.layers[1:]]
        self.weights = [np.random.randn(i, o)  for o, i in zip(self.layers[:-1], self.layers[1:])]
        self.cost = []

    def sigmoid(self, z):
        return (1.0 / (1.0 + np.exp(-z)))

    def sigmoidPrime(self, z):
        return (self.sigmoid(z) * (1 - self.sigmoid(z)))



    def feedForward(self, z, predict = False):
        activations = [z]
        for w, b in zip(self.weights, self.biasses): activations.append(self.sigmoid(np.dot(w, activations[-1]) + b))
        # for activation in activations: print(activation)
        if predict: return np.round(activations[-1])
        return np.array(activations)

    def drawLearningRate(self):
        import matplotlib.pyplot as plt
        plt.xlim(0, len(self.cost))
        plt.ylim(0, 5)
        plt.plot(np.array(self.cost).reshape(-1, 1))
        plt.show()



    def backPropogate(self, x, y):
        bigDW = [np.zeros(w.shape) for w in self.weights]
        bigDB = [np.zeros(b.shape) for b in self.biasses]
        activations = self.feedForward(x)
        delta = activations[-1] - y
        # print(activations[-1])
        # quit()
        self.cost.append(np.sum([- y * np.log(activations[-1]) - (1 - y) * np.log(1 - activations[-1])]))
        for l in range(2, len(self.layers) + 1):
            bigDW[-l + 1] = (1 / len(x)) * np.dot(delta, activations[-l].T)
            bigDB[-l + 1] = (1 / len(x)) * np.sum(delta, axis = 1)
            delta = np.dot(self.weights[-l + 1].T, delta) * self.sigmoidPrime(activations[-l]) 

        for w, dw in zip(self.weights, bigDW): w -= self.learningRate * dw
        for b, db in zip(self.biasses, bigDB): b -= self.learningRate *db.reshape(-1, 1)
        return np.sum(- y * np.log(activations[-1]) - (1 - y) * np.log(1 - activations[-1])) / 2



if __name__ == '__main__':
    nn = myNeuralNet(layers = [2, 2, 1], learningRate = 0.35)
    datasetX = np.array([[1, 1], [0, 1], [1, 0], [0, 0]]).transpose()
    datasetY = np.array([[x ^ y] for x, y in datasetX.T]).reshape(1, -1)
    print(datasetY)
    # print(nn.feedForward(datasetX, predict = True))
    for _ in range(60000): nn.backPropogate(datasetX, datasetY)
    # print(nn.cost)
    print(nn.feedForward(datasetX, predict = True))
    nn.drawLearningRate()

It also sometimes gives a "RuntimeWarning: overflow encountered in exp" which sometimes leads to failure of convergence.

Solution

For cross-entropy error you need to have a probabilistic output layer on the network to correctly work. Sigmoid usually doesn't work and also should not really be used.

Your formulas seem a bit off. For the current network layout you have defined: 3 layers (2, 2, 1) you have w0(2x2) and w1(1x2). Remember to find dw1 you have the following:

  d1 = (guess - target) * sigmoid_prime(net_inputs[1]) <- when you differentiated da2/dz1 you ended up f'(z1) and not f'(a2)!
  dw1 = d1 * activations[1]
  db1 = np.sum(d1, axis=1)
  d0 = d1 * w1 * sigmoid_prime(net_inputs[0])
  dw0 = d0 * activations[0]
  db0 = np.sum(d0, axis=1)

Things to remember is that each layer has net_inputs as

z := w @ x + b

and activations

a := f(z)

. During backpropagation, when you calculate da[i]/dz[i-1] you need to apply the derivative of the activation function to z[i-1] rather than a[i].

z = w @ x + b

a = f(z)

da/dz = f'(z) !!!

And this is for all layers. Some minor points to note:

Switch the error calculation to: np.mean(.5 * (activations[-1] - y) ** 2) if you're not using soft/hardmax activations functions for the output layer (for a single output neuron why would you).
Use z-s in the derivatives of the activation function during delta calculation
Don't use Sigmoid (it is problematic in terms of vanishing gradients), try ReLu: np.where(x <= 0, 0, x)/np.where(x<=0, 0, 1) or some variants of it.
For learning rate on XOR choose between [.0001, .1] should be more than sufficient using any kind of optimization.
If you initialize your weight matrices as: [number_of_input_units x number_of_output_units] rather than [number_of_output_units x number_of_input_units] what you have now, you can change z = w @ x + b to z = x @ w + b and you won't need to transpose your inputs and outputs.

Here's a sample implementation of the above:

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)


def cost(guess, target):
    return np.mean(np.sum(.5 * (guess - target)**2, axis=1), axis=0)


datasetX = np.array([[0., 0.], [0., 1.], [1., 0.], [1., 1.]])
datasetY = np.array([[0.], [1.], [1.], [0.]])


w0 = np.random.normal(0., 1., size=(2, 4))
w1 = np.random.normal(0., 1., size=(4, 1))
b0 = np.zeros(4)
b1 = np.zeros(1)

f1 = lambda x: np.where(x <= 0, 0, x)
df1 = lambda d: np.where(d <= 0, 0, 1)
f2 = lambda x: np.where(x <= 0, .1*x, x)
df2 = lambda d: np.where(d <= 0, .1, 1)


costs = []
for i in range(250):
    a0 = datasetX
    z0 = a0 @ w0 + b0
    a1 = f1(z0)
    z1 = a1 @ w1 + b1
    a2 = f2(z1)
    costs.append(cost(a2, datasetY))

    d1 = (a2 - datasetY) * df2(z1)
    d0 = d1 @ w1.T * df1(z0)

    dw1 = a1.T @ d1
    db1 = np.sum(d1, axis=0)
    dw0 = a0.T @ d0
    db0 = np.sum(d0, axis=0)

    w0 = w0 - .1 * dw0
    b0 = b0 - .1 * db0
    w1 = w1 - .1 * dw1
    b1 = b1 - .1 * db1

print(f2(f1(datasetX @ w0 + b0) @ w1 + b1))

plt.plot(costs)
plt.show()

The result it gives:

[[0.00342399]
 [0.99856158]
 [0.99983358]
 [0.00156524]]