neural-network julia relu flux-machine-learning

Fitting a neural network with ReLUs to polynomial functions

Out of curiosity I am trying to fit neural network with rectified linear units to polynomial functions. For example, I would like to see how easy (or difficult) it is for a neural network to come up with an approximation for the function f(x) = x^2 + x. The following code should be able to do it, but seems to not learn anything. When I run

using Base.Iterators: repeated
ENV["JULIA_CUDA_SILENT"] = true
using Flux
using Flux: throttle
using Random

f(x) = x^2 + x
x_train = shuffle(1:1000)
y_train = f.(x_train)
x_train = hcat(x_train...)

m = Chain(
    Dense(1, 45, relu),
    Dense(45, 45, relu),
    Dense(45, 1),
    softmax
)

function loss(x, y) 
    Flux.mse(m(x), y)
end

evalcb = () -> @show(loss(x_train, y_train))
opt = ADAM()

@show loss(x_train, y_train)

dataset = repeated((x_train, y_train), 50)

Flux.train!(loss, params(m), dataset, opt, cb = throttle(evalcb, 10))

println("Training finished")

@show m([20])

it returns

loss(x_train, y_train) = 2.0100101f14
loss(x_train, y_train) = 2.0100101f14
loss(x_train, y_train) = 2.0100101f14
Training finished
m([20]) = Float32[1.0]

Anyone here sees how I could make the network fit f(x) = x^2 + x?

Solution

There seem to be couple of things wrong with your trial that have mostly to do with how you use your optimizer and treat your input -- nothing wrong with Julia or Flux. Provided solution does learn, but is by no means optimal.

It makes no sense to have softmax output activation on a regression problem. Softmax is used in classification problems where the output(s) of your model represent probabilities and therefore should be on the interval (0,1). It is clear your polynomial has values outside this interval. It is usual to have linear output activation in regression problems like these. This means in Flux no output activation should be defined on the output layer.
The shape of your data matters. train! computes gradients for loss(d...) where d is a batch in your data. In your case a minibatch consists of 1000 samples, and this same batch is repeated 50 times. Neural nets are often trained with smaller batches sizes, but a larger sample set. In the code I provided all batches consist of different data.
For training neural nets, in general, it is advised to normalize your input. Your input takes values from 1 to 1000. My example applies a simple linear transformation to get the input data in the right range.
Normalization can also apply to the output. If the outputs are large, this can result in (too) large gradients and weight updates. Another approach is to lower the learning rate a lot.

using Flux
using Flux: @epochs
using Random

normalize(x) = x/1000
function generate_data(n)
    f(x) = x^2 + x
    xs = reduce(hcat, rand(n)*1000)
    ys = f.(xs)
    (normalize(xs), normalize(ys))
end
batch_size = 32
num_batches = 10000
data_train = Iterators.repeated(generate_data(batch_size), num_batches)
data_test = generate_data(100)


model = Chain(Dense(1,40, relu), Dense(40,40, relu), Dense(40, 1))
loss(x,y) = Flux.mse(model(x), y)

opt = ADAM()
ps = Flux.params(model)
Flux.train!(loss, ps, data_train, opt , cb = () -> @show loss(data_test...))