I'm a complete newbie in ANN. After reading through articles online, I have implemented a FF neural network in C++. Among the parameters of the constructor, these are the important parameters:
I want an output of two decimal numbers which usually range from anywhere between -xx.xx to +xx.xx but can sometimes go up to the hundreds.
I figured I would use one hidden layer and use the sigmoid function for it. The output layer would be a linear function so the value won't be clamped to [-1, 1].
I've searched through many beginner resources online regarding activation functions but most just mention the log-sigmoid / hyperbolic tangent and other non-linear functions. I'm just utterly confused about the usage of a linear function.
My questions are:
Should I just use f(x) = x as my linear function? Is this commonly used? Or should I experiment more with coefficients with functions like f(x) = ax + b?
When I do back propagation, everything I've read so far mentions about taking the derivative of your activation function to calculate deltas. If a linear function is used, how does this work? If I use f(x) = x, the derivative should be 1. My implementation thus uses 1 to calculate the errors and adjust the weights as if a non-linear activation function is used. Am I totally in the wrong direction? I'm utterly confused because none of the resources I read mentioned about this at all.
Thanks!
I think it's useful here to make a distinction between the activation function used for hidden layers, and the activation function used for the output layer. In many models, these activations are not the same, and though the backprop algorithm doesn't care about that, I think it's conceptually quite important.
A canonical neural network architecture consists of an input "layer," one or more hidden layers, and an output layer. (I put the input layer in scare quotes because this layer typically does not have any associated parameters; it's just a way of incorporating the input into the model.) Given an input vector x
, information flows forward through the model, activating each hidden layer in turn, and finally activating the output layer.
Let's consider a network with one input "layer," one hidden layer, and one output layer. The information flow in this model is:
x --> h(x) = s(Wx + b) --> y(x) = r(Vh(x) + c)
Here, I've represented the output of the hidden layer as h(x)
and the output of the output layer as y(x)
. Each layer is constructed as a weighted sum of its input, combined with some offset or bias (this is an affine transformation: Wx + b
for the hidden layer, and Vh + c
for the output layer. Additionally, each layer's affine input transformation is further transformed by a possibly nonlinear "activation function": s(.)
for the hidden layer, and r(.)
for the output layer.
Let's suppose this network is being used for binary classification. It's extremely common these days to use a logistic function for both s
and r
: s(z) = r(z) = (1 + e^-z)^-1
, but they are used this way for different reasons:
For the hidden layer, using a logistic function causes the model's internal representation h(x)
to be a nonlinear function of x
. This gives the model more representational power than using a linear activation s(z) = z
.
For the output layer, the logistic function ensures that the output y(x)
of the model can be treated as the probability of a Bernoulli random variable.
Now let's suppose that you're using a network like this for regression. It's quite common for regression problems to need to model outputs outside the open interval (0, 1). In these cases, it's extremely common to use the logistic function as the activation of the hidden layer s(z) = (1 + e^-z)^-1
but the output layer is activated linearly r(z) = z
, so y(x) = Vh(x) + c
. The reason for using these activation functions is:
For the hidden layer, using a nonlinear activation gives the model more representational power -- just like the classifier model above.
For the output layer, a linear activation ensures that the model can achieve any range of output values. Essentially the output of the model is a basic affine transformation (a scaling, rotation, skew, and/or translation) of whatever is being represented by the hidden layer.
Basically, this is a somewhat long-winded way to say that it sounds like the approach you describe is good for your problem -- use a nonlinear activation for the hidden layer, and a linear one for the output.
Backpropagation is the most widely used method to optimize the parameters of a neural network. Basically backprop is gradient descent; to use it, we need to formulate a loss that's a function of the parameters in our model (W
, b
, V
, and c
).
For regression, typically the loss that's used is the mean squared error (MSE):
L(W, b, V, c) = 1/n * sum i = 1..n (y(X[i]) - t[i])^2
Here, I've assumed that we have access to a training dataset consisting of n
inputs X[i]
and corresponding target values t[i]
. The network output y
is computed as a function of its input X[i]
and the result is compared with t[i]
-- any difference is squared and accumulated into the overall loss.
To optimize the parameters of the model, we need to take the derivative of the loss and set it equal to zero. So taking the derivative of this loss gets us something like:
dL/dW = 1/n sum i = 1..n 2(y(X[i]) - t[i]) y'(X[i])
Here we've used the chain rule to expand the derivative of the loss to include the derivative of the network output as well. This process of expansion is continued all the way "backward" through the model until the chain rule expansion cannot be applied any further.
Here is where you start to see the application of the derivative of the output function, however. For a regression model, y(x) = Vh(x) + c
, so y'(x) = Vh'(x)
. So:
dL/dW = 1/n sum i = 1..n 2(y(X[i]) - t[i]) Vh'(X[i])
But h(x) = s(Wx + b)
so h'(x) = xs'(Wx + b)
(remember here we're taking the derivative with respect to W
).
At any rate, taking all the derivatives gets rather complicated, as you can see for just a two-layer network (or a network with one hidden layer), but the derivatives of the activation functions are just a natural consequence of applying the chain rule while differentiating the overall loss for the model.