Search code examples
machine-learningreinforcement-learningq-learningactivation-function

RL Activation Functions with Negative Rewards


I have a question regarding appropriate activation functions with environments that have both positive and negative rewards.

In reinforcement learning, our output, I believe, should be the expected reward for all possible actions. Since some options have a negative reward, we would want an output range that includes negative numbers.

This would lead me to believe that the only appropriate activation functions would either be linear or tanh. However, I see any many RL papers the use of Relu.

So two questions:

  1. If you do want to have both negative and positive outputs, are you limited to just tanh and linear?

  2. Is it a better strategy (if possible) to scale rewards up so that they are all in the positive domain (i.e. instead of [-1,0,1], [0, 1, 2]) in order for the model to leverage alternative activation functions?


Solution

  • Many RL papers indeed use Relu's for most layers, but typically not for the final output layer. You mentioned the Human Level Control through Deep Reinforcement Learning paper and the Hindsight Experience Replay paper in one of the comments, but neither of those papers describe architectures that use Relu's for the output layer.

    In the Human Level Control through Deep RL paper, page 6 (after references), Section "Methods", last paragraph for the part on "Model architecture" mentions that the output layer is a fully-connected linear layer (not a Relu). So, indeed, all hidden layers can only have nonnegative activation levels (since they all use Relus), but the output layer can have negative activation levels if there are negative weights between the output layer and last hidden layer. This is indeed necessary because the outputs it should create can be interpreted as Q-values (which may be negative).

    In the Hindsight Experience Replay paper, they do not use DQN (like the paper above), but DDPG. This is an "Actor-Critic" algorithm. The "critic" part of this architecture is also intended to output values which can be negative, similar to the DQN architecture, so this also cannot use a Relu for the output layer (but it can still use Relus everywhere else in the network). In Appendix A of this paper, under "Network architecture", it is also described that the actor output layer uses tanh as activation function.

    To answer your specific questions:

    1. If you do want to have both negative and positive outputs, are you limited to just tanh and linear?
    2. Is it a better strategy (if possible) to scale rewards up so that they are all in the positive domain (i.e. instead of [-1,0,1], [0, 1, 2]) in order for the model to leverage alternative activation functions?
    1. Well, there are also other activations (leaky relu, sigmoid, lots of others probably). But a Relu indeed cannot result in negative outputs.
    2. Not 100% sure, possibly. It would often be difficult though, if you have no domain knowledge about how big or small rewards (and/or returns) can possibly get. I have a feeling it would typically be easier to simply end with one fully connected linear layer.