machine-learning neural-network nonlinear-functions

Neural network (non) linearity

I am somewhat confused by the use of the term linear/non-linear when discussing neural networks. Can anyone clarify these 3 points for me:

Each node in a neural net is the weighted sum of inputs. This is a linear combination of inputs. So the value for each node (ignoring activation) is given by some linear function. I hear that neural nets are universal function approximators. Does this mean that, despite containing linear functions within each node, the total network is able to approximate a non-linear function as well? Are there any clear examples of how this works in practise?
An activation function is applied to the output of that node to squash/transform the output for further propagation through the rest of the network. Am I correct in interpreting this output from the activation function as the "strength" of that node?
Activation functions are also referred to as nonlinear functions. Where does the term non-linear come from? Because the input into activation is the result of linear combination of inputs into the node. I assume it's referring to the idea that something like the sigmoid function is a non-linear function? Why does it matter that the activation is non-linear?

Solution

1 Linearity

A neural network is only non-linear if you squash the output signal from the nodes with a non-linear activation function. A complete neural network (with non-linear activation functions) is an arbitrary function approximator.

Bonus: It should be noted that if you are using linear activation functions in multiple consecutive layers, you could just as well have pruned them down to a single layer due to them being linear. (The weights would be changed to more extreme values). Creating a network with multiple layers using linear activation functions would not be able to model more complicated functions than a network with a single layer.

2 Activation signal

Interpreting the squashed output signal could very well be interpreted as the strength of this signal (biologically speaking). Thought it might be incorrect to interpret the output strength as an equivalent of confidence as in fuzzy logic.

3 Non-linear activation functions

Yes, you are spot on. The input signals along with their respective weights are a linear combination. The non-linearity comes from your selection of activation functions. Remember that a linear function is drawn as a line - sigmoid, tanh, ReLU and so on may not be drawn with a single straight line.

Why do we need non-linear activation functions?

Most functions and classification tasks are probably best described by non-linear functions. If we decided to use linear activation functions we would end up with a much coarser approximation on a complex function.

Universal approximators

You can sometimes read in papers that neural networks are universal approximators. This implies that a "perfect" network could be fitted to any model/function you could throw at it, though configuring the perfect network (#nodes and #layers ++) is a non-trivial task.

Read more about the implications at this Wikipedia page.