Search code examples
pythonmachine-learningdeep-learningpytorchactivation-function

What is the best choice for an activation function in case of small sized neural networks


I am using pytorch and autograd to build my neural network architecture. It is a small 3 layered network with a sinngle input and output. Suppose I have to predict some output function based on some initial conditions and I am using a custom loss function.

The problem I am facing is:

  1. My loss converges initially but gradients vanish eventually.

  2. I have tried sigmoid activation and tanh. tanh gives slightly better results in terms of loss convergence.

  3. I tried using ReLU but since I don't have much weights in my neural network, the weights become dead and it doesn't give good results.

Is there any other activation function apart from sigmoid and tanh that handles the problem of vanishing gradients well enough for small sized neural networks? Any suggestions on what else can I try?


Solution

  • In the deep learning world, ReLU is usually prefered over other activation functions, because it overcomes the vanishing gradient problem, allowing models to learn faster and perform better. But it could have downsides.

    Dying ReLU problem

    The dying ReLU problem refers to the scenario when a large number of ReLU neurons only output values of 0. When most of these neurons return output zero, the gradients fail to flow during backpropagation and the weights do not get updated. Ultimately a large part of the network becomes inactive and it is unable to learn further.

    What causes the Dying ReLU problem?

    • High learning rate: If learning rate is set too high, there is a significant chance that new weights will be in negative value range.
    • Large negative bias: Large negative bias term can indeed cause the inputs to the ReLU activation to become negative.

    How to solve the Dying ReLU problem?

    • Use of a smaller learning rate: It can be a good idea to decrease the learning rate during the training.

    • Variations of ReLU: Leaky ReLU is a common effective method to solve a dying ReLU problem, and it does so by adding a slight slope in the negative range. There are other variations like PReLU, ELU, GELU. If you want to dig deeper check out this link.

    • Modification of initialization procedure: It has been demonstrated that the use of a randomized asymmetric initialization can help prevent the dying ReLU problem. Do check out the arXiv paper for the mathematical details

    Sources:

    Practical guide for ReLU

    ReLU variants

    Dying ReLU problem