Search code examples
reinforcement-learningpolicy-gradient-descent

How to solve the zero probability problem in the policy gradient?


Recently, I have tried to apply the naive policy gradient method to my problem. However, I found that the difference between different outputs of the last layer of the neural network is huge, which means that after applying the softmax layer, only one action will be marked as 1, and other actions will be marked as 0. For instance, the output of the last layer is shown below:

[ 242.9629, -115.6593,   63.3984,  226.1815,  131.5903, -316.6087,
 -205.9341,   98.7216,  136.7644,  266.8708,   19.2289,   47.7531]

After applying the softmax function, it is clear that only one action will be chosen.

[4.1395e-11, 0.0000e+00, 0.0000e+00, 2.1323e-18, 0.0000e+00, 0.0000e+00,
 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00]

This problem severely affects the final performance as the neural network will try constant action only after a few steps. Therefore, is there any way to solve this problem?

(By the way, even if I have tried to give some negative rewards to the neural network, the actions chosen by it are still unchanged.)

My training curve is shown as follows: Training curve


Solution

  • In fact, there is no deterministic way to solve this problem as this is an age-old problem in the optimization domain called "exploration-exploitation dilemma". Specifically, in reinforcement learning, there are two simple ways to solve this problem:

    1. Firstly, reducing the learning rate is the simplest way to solve this problem. With a lower learning rate, the policy network can explore more different actions, and thus avoid being stuck at a local optimum.
    2. Secondly, adding the policy entropy term into the loss function is another way to solve this problem. A good example of this idea is the soft actor-critic (SAC) algorithm.

    Both methods have been validated in my task, and both of them effectively alleviate the pre-mature problem. However, both of them have a parameter that needs to be tuned by hand, which increases the complexity of my algorithm.

    By the way, similar to Q-Learning, we can also use the epsilon-greedy mechanism to encourage the agent to explore more actions. However, this is not an elegant method to solve this problem because it is hard to determine the epsilon value.