machine-learning deep-learning neural-network

How to decide on activation function?

Currently there are a lot of activation functions like sigmoid, tanh, ReLU ( being the preferred choice ), but I have a question that concerns which choices are needed to be considered so that a certain activation function should be selected.

For example : When we want to Upsample a network in GANs, we prefer using LeakyReLU.

My knowledge uptil now :
Sigmoid : When you have a binary class to identify
Tanh : ?
ReLU : ?
LeakyReLU : When you want to upsample

Any help or article?

Solution

This is an open research question. The choice of activation is also very intertwined with the architecture of the model and the computation / resources available so it's not something that can be answered in silo. The paper Efficient Backprop, Yann LeCun et. al. has a lot of good insights into what makes a good activation function.

That being said, here are some toy examples that may help get intuition for activation functions. Consider a simple MLP with one hidden layer and a simple classification task:

In the last layer we can use sigmoid in combination with the binary_crossentropy loss in order to use intuition from logistic regression - because we're just doing simple logistic regression on the learned features that the hidden layer gives to the last layer.

What types of features are learned depends on the activation function used in that hidden layer and the number of neurons in that hidden layer.

Here is what ReLU learns when using two hidden neurons:

https://miro.medium.com/max/2000/1*5nK725uTBUeoIA0XjEyA_A.gif

(on the left is what the decision boundary looks like in the feature space)

As you add more neurons you get more pieces with which to approximate the decision boundary. Here is with 3 hidden neurons:

And 10 hidden neurons:

Sigmoid and Tanh produce similar decsion boundaries (this is tanh https://miro.medium.com/max/2000/1*jynT0RkGsZFqt3WSFcez4w.gif - sigmoid is similar) which are more continuous and sinusoidal.

The main difference is that sigmoid is not zero-centered which doesn't make it a good choice for a hidden layer - especially in deep networks.