python-3.x neural-network xor backpropagation convergence

Why doesn't this simple neural network converge for XOR?

The code for the network below works okay, but it's too slow. This site implies that the network should get 99% accuracy after 100 epochs with a learning rate of 0.2, while my network never gets past 97% even after 1900 epochs.

Epoch 0, Inputs [0 0], Outputs [-0.83054376], Targets [0]
Epoch 100, Inputs [0 1], Outputs [ 0.72563824], Targets [1]
Epoch 200, Inputs [1 0], Outputs [ 0.87570863], Targets [1]
Epoch 300, Inputs [0 1], Outputs [ 0.90996706], Targets [1]
Epoch 400, Inputs [1 1], Outputs [ 0.00204791], Targets [0]
Epoch 500, Inputs [0 1], Outputs [ 0.93396672], Targets [1]
Epoch 600, Inputs [0 0], Outputs [ 0.00006375], Targets [0]
Epoch 700, Inputs [0 1], Outputs [ 0.94778227], Targets [1]
Epoch 800, Inputs [1 1], Outputs [-0.00149935], Targets [0]
Epoch 900, Inputs [0 0], Outputs [-0.00122716], Targets [0]
Epoch 1000, Inputs [0 0], Outputs [ 0.00457281], Targets [0]
Epoch 1100, Inputs [0 1], Outputs [ 0.95921556], Targets [1]
Epoch 1200, Inputs [0 1], Outputs [ 0.96001748], Targets [1]
Epoch 1300, Inputs [1 0], Outputs [ 0.96071742], Targets [1]
Epoch 1400, Inputs [1 1], Outputs [ 0.00110912], Targets [0]
Epoch 1500, Inputs [0 0], Outputs [-0.00012382], Targets [0]
Epoch 1600, Inputs [1 0], Outputs [ 0.9640324], Targets [1]
Epoch 1700, Inputs [1 0], Outputs [ 0.96431516], Targets [1]
Epoch 1800, Inputs [0 1], Outputs [ 0.97004973], Targets [1]
Epoch 1900, Inputs [1 0], Outputs [ 0.96616225], Targets [1]

The dataset I'm using is:

The training set is read using a function in a helper file, but that isn't relevant to the network.

import numpy as np
import helper

FILE_NAME = 'data.txt'
EPOCHS = 2000
TESTING_FREQ = 5
LEARNING_RATE = 0.2

INPUT_SIZE = 2
HIDDEN_LAYERS = [5]
OUTPUT_SIZE = 1


class Classifier:
    def __init__(self, layer_sizes):
        np.set_printoptions(suppress=True)

        self.activ = helper.tanh
        self.dactiv = helper.dtanh

        network = list()
        for i in range(1, len(layer_sizes)):
            layer = dict()
            layer['weights'] = np.random.randn(layer_sizes[i], layer_sizes[i-1])
            layer['biases'] = np.random.randn(layer_sizes[i])
            network.append(layer)

        self.network = network

    def forward_propagate(self, x):
        for i in range(0, len(self.network)):
            self.network[i]['outputs'] = self.network[i]['weights'].dot(x) + self.network[i]['biases']
            if i != len(self.network)-1:
                self.network[i]['outputs'] = x = self.activ(self.network[i]['outputs'])
            else:
                self.network[i]['outputs'] = self.activ(self.network[i]['outputs'])
        return self.network[-1]['outputs']

    def backpropagate_error(self, x, targets):
        self.forward_propagate(x)
        self.network[-1]['deltas'] = (self.network[-1]['outputs'] - targets) * self.dactiv(self.network[-1]['outputs'])
        for i in reversed(range(len(self.network)-1)):
            self.network[i]['deltas'] = self.network[i+1]['deltas'].dot(self.network[i+1]['weights'] * self.dactiv(self.network[i]['outputs']))

    def adjust_weights(self, inputs, learning_rate):
        self.network[0]['weights'] -= learning_rate * np.atleast_2d(self.network[0]['deltas']).T.dot(np.atleast_2d(inputs))
        self.network[0]['biases'] -= learning_rate * self.network[0]['deltas']
        for i in range(1, len(self.network)):
            self.network[i]['weights'] -= learning_rate * np.atleast_2d(self.network[i]['deltas']).T.dot(np.atleast_2d(self.network[i-1]['outputs']))
            self.network[i]['biases'] -= learning_rate * self.network[i]['deltas']

    def train(self, inputs, targets, epochs, testfreq, lrate):
        for epoch in range(epochs):
            i = np.random.randint(0, len(inputs))
            if epoch % testfreq == 0:
                predictions = self.forward_propagate(inputs[i])
                print('Epoch %s, Inputs %s, Outputs %s, Targets %s' % (epoch, inputs[i], predictions, targets[i]))
            self.backpropagate_error(inputs[i], targets[i])
            self.adjust_weights(inputs[i], lrate)


inputs, outputs = helper.readInput(FILE_NAME, INPUT_SIZE, OUTPUT_SIZE)
print('Input data: {0}'.format(inputs))
print('Output targets: {0}\n'.format(outputs))
np.random.seed(1)

nn = Classifier([INPUT_SIZE] + HIDDEN_LAYERS + [OUTPUT_SIZE])

nn.train(inputs, outputs, EPOCHS, TESTING_FREQ, LEARNING_RATE)

Solution

The main bug is that you are doing the forward pass only 20% of the time, i.e. when epoch % testfreq == 0:

for epoch in range(epochs):
  i = np.random.randint(0, len(inputs))
  if epoch % testfreq == 0:
    predictions = self.forward_propagate(inputs[i])
    print('Epoch %s, Inputs %s, Outputs %s, Targets %s' % (epoch, inputs[i], predictions, targets[i]))
  self.backpropagate_error(inputs[i], targets[i])
  self.adjust_weights(inputs[i], lrate)

When I take predictions = self.forward_propagate(inputs[i]) out of if, I get much better results faster:

Epoch 100, Inputs [0 1], Outputs [ 0.80317447], Targets 1
Epoch 105, Inputs [1 1], Outputs [ 0.96340466], Targets 1
Epoch 110, Inputs [1 1], Outputs [ 0.96057278], Targets 1
Epoch 115, Inputs [1 0], Outputs [ 0.87960599], Targets 1
Epoch 120, Inputs [1 1], Outputs [ 0.97725825], Targets 1
Epoch 125, Inputs [1 0], Outputs [ 0.89433666], Targets 1
Epoch 130, Inputs [0 0], Outputs [ 0.03539024], Targets 0
Epoch 135, Inputs [0 1], Outputs [ 0.92888141], Targets 1

Also, note that the term epoch usually means a single run of all of your training data, in your case 4. So, in fact, you are doing 4 times less epochs.

Update

I didn't pay attention to the details, as a result, missed few subtle yet important notes:

the training data in the question represents OR, not XOR, so my results above are for learning OR operation;
backward pass executes forward pass as well (so it's not a bug, rather a surprising implementation detail).

Knowing this, I've updated the data and checked the script once again. Running the training for 10000 iterations gave ~0.001 average error, so the model is learning, simply not so fast as it could.

A simple neural network (without embedded normalization mechanism) is pretty sensitive to particular hyperparameters, such as initialization and the learning rate. I tried various values manually and here's what I've got:

# slightly bigger learning rate
LEARNING_RATE = 0.3
...
# slightly bigger init variation of weights
layer['weights'] = np.random.randn(layer_sizes[i], layer_sizes[i-1]) * 2.0

This gives the following performance:

...
Epoch 960, Inputs [1 1], Outputs [ 0.01392014], Targets 0
Epoch 970, Inputs [0 0], Outputs [ 0.04342895], Targets 0
Epoch 980, Inputs [1 0], Outputs [ 0.96471654], Targets 1
Epoch 990, Inputs [1 1], Outputs [ 0.00084511], Targets 0
Epoch 1000, Inputs [0 0], Outputs [ 0.01585915], Targets 0
Epoch 1010, Inputs [1 1], Outputs [-0.004097], Targets 0
Epoch 1020, Inputs [1 1], Outputs [ 0.01898956], Targets 0
Epoch 1030, Inputs [0 0], Outputs [ 0.01254217], Targets 0
Epoch 1040, Inputs [1 1], Outputs [ 0.01429213], Targets 0
Epoch 1050, Inputs [0 1], Outputs [ 0.98293925], Targets 1
...
Epoch 1920, Inputs [1 1], Outputs [-0.00043072], Targets 0
Epoch 1930, Inputs [0 1], Outputs [ 0.98544288], Targets 1
Epoch 1940, Inputs [1 0], Outputs [ 0.97682002], Targets 1
Epoch 1950, Inputs [1 0], Outputs [ 0.97684186], Targets 1
Epoch 1960, Inputs [0 0], Outputs [-0.00141565], Targets 0
Epoch 1970, Inputs [0 0], Outputs [-0.00097559], Targets 0
Epoch 1980, Inputs [0 1], Outputs [ 0.98548381], Targets 1
Epoch 1990, Inputs [1 0], Outputs [ 0.97721286], Targets 1

The average accuracy is close to 98.5% after 1000 iterations and 99.1% after 2000 iterations. It's a bit slower than promised, but good enough. I'm sure it can be tuned further, but it's not the goal of this toy exercise. After all, tanh is not the best activation function, and classification problems should better be solved with cross-entropy loss (rather than L2 loss). So I wouldn't worry too much about performance of this particular network and go on to the logistic regression. That will be definitely better in terms of speed of learning.