R neuralnet taking huge number of steps for linear function f(x)=19x

So here is the scenario:-

I ran the following code which creates neural network (uses neuralnet package in R) to approximate the function f(x)=x^2:-

set.seed(2016);
rm(list=ls());
# Prepare Training Data
attribute<-as.data.frame(sample(seq(-2,2,length=50), 50 , replace=FALSE ),ncol=1);
response<-attribute^2;
data <- cbind(attribute,response);
colnames(data)<- c("attribute","response");

# Create DNN
fit<-neuralnet(response~attribute,data=data,hidden = c(3,3),threshold=0.01);
fit$result.matrix;

This worked fine and converged in 3191 steps. Now I made a small change to the code -- I change the function being approximated. Instead of a quadratic function I used a very simple linear function f(x)=2x. This worked fine too, then I tweaked the coefficient of x and conducted multiple runs e.g.

f(x) = 2x
f(x) = 3x
.
.
f(x) = 19x

Up to this point it worked fine. But one thing I noticed is that number of steps required to converged were dramatically increasing from 2x to 19x. The number of steps for 19x for example are astonishing 84099. This is weird that the DNN is taking so many steps to converge only for a linear function whereas for the quadratic function f(x)=x^2 it only took 3191 steps.

So when I changed the function to f(x)=20x , it probably needs more steps and so I got the following warning:-

> set.seed(2016);
> rm(list=ls());
> # Prepare Training Data
> attribute<-as.data.frame(sample(seq(-2,2,length=50), 50 , replace=FALSE ),ncol=1);
> response<-attribute*20;
> data <- cbind(attribute,response);
> colnames(data)<- c("attribute","response");
> 
> # Create DNN
> fit<-neuralnet(response~attribute,data=data,hidden = c(3,3),threshold=0.01);
Warning message:
algorithm did not converge in 1 of 1 repetition(s) within the stepmax 
> fit$result.matrix;

So I guess I can tweak the default stepmax parameter and increase the steps. But the real question is -- why should it need so many steps just for a simple linear function like this?

Solution

The biggest thing here, I believe, is that you need to scale your data. As you values get larger the neural network needs to cover a larger range of results (this can be dealt with using more sophisticated techniques like different forms of momentum but that isn't the point here). If you simply scale your data like so:

maxs <- apply(data, 2, max) 
mins <- apply(data, 2, min)

raw.sc <- scale(data, center = mins, scale = maxs - mins)
scaled <- as.data.frame(raw.sc)

You can now create a NN much more quickly. Note - you shouldn't need multiple layers for a linear function. Here I demonstrate this with a single layer, one node network. This isn't really a problem for 'deep learning'.

set.seed(123)
# Create NN
fit<-neuralnet(response~attribute,data=scaled,hidden = c(1),threshold=0.01);


# make sure to use normalized input and de-normalize the output
pr.nn <- compute(fit, scaled$attribute)
pr.nn_ <- pr.nn$net.result*attr(raw.sc, 'scaled:scale')[2] + attr(raw.sc, 'scaled:center')[2]

In this case, it converges in 1349 steps. You then calculate some metric like MSE.

# MSE
sum((data$response - pr.nn_)^2)/nrow(data)