can't predict in mxnet 0.94 for R

I have been able to use nnet and neuralnet to predict values in a conventional backprop network, but have been strugling to do the same with MXNET and R for many reasons.

This is the file (simple CSV with headers, columns have been normalized) https://files.fm/u/cfhf3zka

And this is the code I use:

filedata <- read.csv("example.csv")

require(mxnet)

datain <- filedata[,1:3]
dataout <- filedata[,4]

lcinm <- data.matrix(datain, rownames.force = "NA")
lcoutm <- data.matrix(dataout, rownames.force = "NA")
lcouta <- as.numeric(lcoutm)

data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=3)
act1 <- mx.symbol.Activation(fc1, name="sigm1", act_type="sigmoid")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=3)
act2 <- mx.symbol.Activation(fc2, name="sigm2", act_type="sigmoid")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=3)
act3 <- mx.symbol.Activation(fc3, name="sigm3", act_type="sigmoid")
fc4 <- mx.symbol.FullyConnected(act3, name="fc4", num_hidden=1)
softmax <- mx.symbol.LogisticRegressionOutput(fc4, name="softmax")

mx.set.seed(0)
mxn <- mx.model.FeedForward.create(array.layout = "rowmajor", softmax, X = lcinm, y = lcouta, learning.rate=0.01, eval.metric=mx.metric.rmse)

preds <- predict(mxn, lcinm)

predsa <-array(preds)

predsa

The console output is:

Start training with 1 devices
[1] Train-rmse=0.0852988247858687
[2] Train-rmse=0.068769514264606
[3] Train-rmse=0.0687647380075881
[4] Train-rmse=0.0687647164103567
[5] Train-rmse=0.0687647161066822
[6] Train-rmse=0.0687647160828069
[7] Train-rmse=0.0687647161241598
[8] Train-rmse=0.0687647160882147
[9] Train-rmse=0.0687647160594508
[10] Train-rmse=0.068764716079949
> preds <- predict(mxn, lcinm)
Warning message:
In mx.model.select.layout.predict(X, model) :
  Auto detect layout of input matrix, use rowmajor..

> predsa <-array(preds)
> predsa
   [1] 0.6776764 0.6776764 0.6776764 0.6776764 0.6776764 0.6776764 0.6776764 0.6776764 0.6776764
  [10] 0.6776764 0.6776764 0.6776764 0.6776764 0.6776764 0.6776764 0.6776764 0.6776764 0.6776764

So it gets an "average" but is not able to predict values, have tried other ways and learningrates to avoid overprediction but have never reached an even variable output.

Solution

I tried you're example and it seems like you're trying to predict continuous output with a LogisticRegressionOutput. I believe you should use the LinearRegressionOutput. You can see examples of this here and a Julia example here. Also, since you're predicting continuous output it might be better to use a different activation function such as ReLu, see some reasons for this at this question.

With these changes, I produced the following code:

data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=3)
act1 <- mx.symbol.Activation(fc1, name="sigm1", act_type="softrelu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=3)
act2 <- mx.symbol.Activation(fc2, name="sigm2", act_type="softrelu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=3)
act3 <- mx.symbol.Activation(fc3, name="sigm3", act_type="softrelu")
fc4 <- mx.symbol.FullyConnected(act3, name="fc4", num_hidden=1)
softmax <- mx.symbol.LinearRegressionOutput(fc4, name="softmax")

mx.set.seed(0)
mxn <- mx.model.FeedForward.create(array.layout = "rowmajor",
                                   softmax,
                                   X = lcinm,
                                   y = lcouta,
                                   learning.rate=1,
                                   eval.metric=mx.metric.rmse,
                                   num.round = 100)

preds <- predict(mxn, lcinm)

predsa <-array(preds)
require(ggplot2)
qplot(x = dataout, y = predsa, geom = "point", alpha = 0.6) +
  geom_abline(slope = 1)

This gets me a constantly diminishing error rate:

Start training with 1 devices
[1] Train-rmse=0.0725415842873665
[2] Train-rmse=0.0692660343340093
[3] Train-rmse=0.0692562284995407
...
[97] Train-rmse=0.048629236911287
[98] Train-rmse=0.0486272021266279
[99] Train-rmse=0.0486251858007309
[100] Train-rmse=0.0486231872849457

And the predicted outputs start to align with the actual outputs as demonstrated with this plot: