python tensorflow machine-learning keras neural-network

Tensorflow for XOR is not predicting correctly after 500 epochs

I'm trying to implement a Neural Network to solve the XOR problem using TensorFlow. I chose sigmoid as activation function, shape (2, 2, 1) and optimizer=SGD(). I choose batch_size=1 because the universe of the problem is 4, so is really small. The problem is the predictions are not even close to the right answers. What am I doing wrong?

I'm doing this on Google Colab, and the Tensorflow version is 2.3.0.

import tensorflow as tf
import numpy as np



x = np.array([[0, 0],
              [1, 1],
              [1, 0],
              [0, 1]],  dtype=np.float32)

y = np.array([[0], 
              [0], 
              [1], 
              [1]],     dtype=np.float32)



model =  tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(2,)))
model.add(tf.keras.layers.Dense(2, activation=tf.keras.activations.sigmoid))
model.add(tf.keras.layers.Dense(2, activation=tf.keras.activations.sigmoid))
model.add(tf.keras.layers.Dense(1, activation=tf.keras.activations.sigmoid))

model.compile(optimizer=tf.keras.optimizers.SGD(), 
              loss=tf.keras.losses.MeanSquaredError(), 
              metrics=['binary_accuracy'])

history = model.fit(x, y, batch_size=1, epochs=500, verbose=False)

print("Tensorflow version: ", tf.__version__)
predictions = model.predict_on_batch(x)
print(predictions)

The output:

Tensorflow version:  2.3.0
WARNING:tensorflow:10 out of the last 10 calls to <function Model.make_predict_function.<locals>.predict_function at 0x7f69f7a83a60> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
[[0.5090364 ]
[0.4890102 ]
[0.50011414]
[0.49678832]]

Solution

The problem is your learning rate and the way you are optimizing your weights

Another factor to keep in mind when we are training is the step size that we take in the direction of the gradient. If this step is too great, we can end up in a wrong position, jumping outside of our local minimum. If too small we could never reach the minimum.

By default the Stochastic Gradient Descent (SGD) in keras, has a learning rate of 0.01. And this learning rate is fixed during training. If you check your training, the loss is moving too slow toward the global minimum or jumping to higher values sometimes. For your specific problem, it's quite difficult to reach the minimum with a fixed learning rate, because you are not taking in consideration the loss function landscape.

For example, using Adam as optimizer algorithm and a learning_rate = 0.02, I was able to reach an accuracy of 1

import tensorflow as tf
import numpy as np

x = np.array([[0, 0],
              [1, 1],
              [1, 0],
              [0, 1]],  dtype=np.float32)

y = np.array([[0], 
              [0], 
              [1], 
              [1]],     dtype=np.float32)

model =  tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(2,)))
model.add(tf.keras.layers.Dense(2, activation=tf.keras.activations.sigmoid))
model.add(tf.keras.layers.Dense(2, activation=tf.keras.activations.sigmoid))
model.add(tf.keras.layers.Dense(1, activation=tf.keras.activations.sigmoid))

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.02), # learning rate was 0.001 prior to this change
              loss=tf.keras.losses.MeanSquaredError(), 
              metrics=['mse', 'binary_accuracy'])
model.summary()
print("Tensorflow version: ", tf.__version__)
predictions = model.predict_on_batch(x)
print(predictions)history = model.fit(x, y, batch_size=1, epochs=500)

[[0.05162644]
[0.06670767]
[0.9240402 ]
[0.923379  ]]

I used Adam because it has an adaptive learning rate the which is tuned during training, depending on how the train is going.

If you use a greater learning rate (0.1), but using SGD, in the history training losses you can see that at one point the accuracy reach 1, but right after that it jumps to lower values. That's because you have a fixed learning rate. Another strategy would be to stop the training when you reach that values with SGD, maybe with a keras callback.

Don't forget to tune your learning rate and to choose the right optimizer. It's fundamental to obtain a fast training and a good minimum.

Also consider changing the net architecture (adding nodes, and use other activation functions for the hidden layers, like Relu)

Here some useful details on how to handle the learning rate