MxNet with R: Simple XOR Neural Network is not learning

i am wanted to experiment with the MxNet library and built a simple neural network which learns the XOR function. I am facing the problem, that the model is not learning.

Here is the complete script:

library(mxnet)

train = matrix(c(0,0,0,
                 0,1,1,
                 1,0,1,
                 1,1,0),
               nrow=4,
               ncol=3,
               byrow=TRUE)

train.x = train[,-3]
train.y = train[,3]

data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=2)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=3)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=1)
softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")

mx.set.seed(0)
model <- mx.model.FeedForward.create(
  softmax,
  X = t(train.x),
  y = train.y,
  num.round = 10,
  array.layout = "columnmajor",
  learning.rate = 0.01,
  momentum = 0.4,
  eval.metric = mx.metric.accuracy,
  epoch.end.callback = mx.callback.log.train.metric(100))

predict(model,train.x,array.layout="rowmajor")

An this output is produced:

Start training with 1 devices
[1] Train-accuracy=NaN
[2] Train-accuracy=0.5
[3] Train-accuracy=0.5
[4] Train-accuracy=0.5
[5] Train-accuracy=0.5
[6] Train-accuracy=0.5
[7] Train-accuracy=0.5
[8] Train-accuracy=0.5
[9] Train-accuracy=0.5
[10] Train-accuracy=0.5

> predict(model,train.x,array.layout="rowmajor")
[,1] [,2] [,3] [,4]
[1,]    1    1    1    1

How should I use mxnet to get this example working?

Regards, vaka

Solution

Usually activation layer doesn't go right after input as it should be activated once the first layer's calculation is done. You can still achieve imitating XOR function with your old code, but it needs a few tweaks:

You are right that you need to initialize weights. It is a big discussion in Deep Learning community which initial weights are the best, but from my practice Xavier weights are working well
If you want to use softmax, you need to change the last hidden layer units quantity to 2, because you have 2 classes: 0 and 1

After doing these 2 things + few minor optimizations, like removing transposing of the matrix, with the following code:

library(mxnet)

train = matrix(c(0,0,0,
                 0,1,1,
                 1,0,1,
                 1,1,0),
               nrow=4,
               ncol=3,
               byrow=TRUE)

train.x = train[,-3]
train.y = train[,3]

data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=2)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=3)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=2)
softmax <- mx.symbol.Softmax(fc3, name="sm")

mx.set.seed(0)
model <- mx.model.FeedForward.create(
  softmax,
  X = train.x,
  y = train.y,
  num.round = 50,
  array.layout = "rowmajor",
  learning.rate = 0.1,
  momentum = 0.99,
  eval.metric = mx.metric.accuracy,
  initializer = mx.init.Xavier(rnd_type = "uniform", factor_type = "avg", magnitude = 3),
  epoch.end.callback = mx.callback.log.train.metric(100))

predict(model,train.x,array.layout="rowmajor")

We get the following results:

Start training with 1 devices
[1] Train-accuracy=NaN
[2] Train-accuracy=0.75
[3] Train-accuracy=0.5
[4] Train-accuracy=0.5
[5] Train-accuracy=0.5
[6] Train-accuracy=0.5
[7] Train-accuracy=0.5
[8] Train-accuracy=0.5
[9] Train-accuracy=0.5
[10] Train-accuracy=0.75
[11] Train-accuracy=0.75
[12] Train-accuracy=0.75
[13] Train-accuracy=0.75
[14] Train-accuracy=0.75
[15] Train-accuracy=0.75
[16] Train-accuracy=0.75
[17] Train-accuracy=0.75
[18] Train-accuracy=0.75
[19] Train-accuracy=0.75
[20] Train-accuracy=0.75
[21] Train-accuracy=0.75
[22] Train-accuracy=0.5
[23] Train-accuracy=0.5
[24] Train-accuracy=0.5
[25] Train-accuracy=0.75
[26] Train-accuracy=0.75
[27] Train-accuracy=0.75
[28] Train-accuracy=0.75
[29] Train-accuracy=0.75
[30] Train-accuracy=0.75
[31] Train-accuracy=0.75
[32] Train-accuracy=0.75
[33] Train-accuracy=0.75
[34] Train-accuracy=0.75
[35] Train-accuracy=0.75
[36] Train-accuracy=0.75
[37] Train-accuracy=0.75
[38] Train-accuracy=0.75
[39] Train-accuracy=1
[40] Train-accuracy=1
[41] Train-accuracy=1
[42] Train-accuracy=1
[43] Train-accuracy=1
[44] Train-accuracy=1
[45] Train-accuracy=1
[46] Train-accuracy=1
[47] Train-accuracy=1
[48] Train-accuracy=1
[49] Train-accuracy=1
[50] Train-accuracy=1
> 
> predict(model,train.x,array.layout="rowmajor")
          [,1]         [,2]         [,3]         [,4]
[1,] 0.9107883 2.618128e-06 6.384078e-07 0.9998743534
[2,] 0.0892117 9.999974e-01 9.999994e-01 0.0001256234
'''

The output of softmax is interpreted as "a probability of belonging to a class" - it is not a "0" or "1" value as one gets after doing regular math. The answer means the following:

In case "0 and 0": probability of class "0" = 0.9107883 and of class "1" = 0.0892117, meaning it is 0
In case "0 and 1": probability of class "0" = 2.618128e-06 and of class "1" = 9.999974e-01, meaning it is 1 (probability of 1 is much higher)
In case "1 and 0": probability of class "0" = 6.384078e-07 and of class "1" = 9.999994e-01 (probability of 1 is much higher)
In case "1 and 1": probability of class "0" = 0.9998743534 and of class "1" = 0.0001256234, meaning it is 0.