machine-learning deep-learning lstm recurrent-neural-network activation-function

What is the intuition of using tanh in LSTM?

In an LSTM network (Understanding LSTMs), why does the input gate and output gate use tanh?

What is the intuition behind this?

It is just a nonlinear transformation? If it is, can I change both to another activation function (e.g., ReLU)?

Solution

Sigmoid specifically, is used as the gating function for the three gates (in, out, and forget) in LSTM, since it outputs a value between 0 and 1, and it can either let no flow or complete flow of information throughout the gates.

On the other hand, to overcome the vanishing gradient problem, we need a function whose second derivative can sustain for a long range before going to zero. Tanh is a good function with the above property.

A good neuron unit should be bounded, easily differentiable, monotonic (good for convex optimization) and easy to handle. If you consider these qualities, then I believe you can use ReLU in place of the tanh function since they are very good alternatives of each other.

But before making a choice for activation functions, you must know what the advantages and disadvantages of your choice over others are. I am shortly describing some of the activation functions and their advantages.

Sigmoid

Mathematical expression: sigmoid(z) = 1 / (1 + exp(-z))

First-order derivative: sigmoid'(z) = -exp(-z) / 1 + exp(-z)^2

Advantages:

(1) The sigmoid function has all the fundamental properties of a good activation function.

Tanh

Mathematical expression: tanh(z) = [exp(z) - exp(-z)] / [exp(z) + exp(-z)]

First-order derivative: tanh'(z) = 1 - ([exp(z) - exp(-z)] / [exp(z) + exp(-z)])^2 = 1 - tanh^2(z)

Advantages:

(1) Often found to converge faster in practice
(2) Gradient computation is less expensive

Hard Tanh

Mathematical expression: hardtanh(z) = -1 if z < -1; z if -1 <= z <= 1; 1 if z > 1

First-order derivative: hardtanh'(z) = 1 if -1 <= z <= 1; 0 otherwise

Advantages:

(1) Computationally cheaper than Tanh
(2) Saturate for magnitudes of z greater than 1

ReLU

Mathematical expression: relu(z) = max(z, 0)

First-order derivative: relu'(z) = 1 if z > 0; 0 otherwise

Advantages:

(1) Does not saturate even for large values of z
(2) Found much success in computer vision applications

Leaky ReLU

Mathematical expression: leaky(z) = max(z, k dot z) where 0 < k < 1

First-order derivative: relu'(z) = 1 if z > 0; k otherwise

Advantages:

(1) Allows propagation of error for non-positive z which ReLU doesn't

This paper explains some fun activation function. You may consider to read it.