I'm a complete newbie to Reinforcement Learning. And I have a question about the choice of the activation function of the output layer for the keras-rl agents. In all the examples provided by keras-rl (https://github.com/matthiasplappert/keras-rl/tree/master/examples) choose linear activation function in the output layer. Why is this? What effect would we expect if I go with different activation function? For example, if I work with an OpenAI environment with a discrete action space of 5, should I also consider using softmax in the output layer for an agent? Thanks much in advance.
For some of the agents in keras-rl linear
activation function is used, even though the agents are working with discrete action spaces (for example, dqn, ddqn). But, for example, CEM uses softmax
activation function for discrete action spaces (which is what one would expect).
The reason behind linear
activation function for dqn and ddqn is its exploration policy, which is a part of the agent. If we consider the class of exploration policy used for both of them as an example and a method select_action
, we will see the following:
class BoltzmannQPolicy(Policy):
def __init__(self, tau=1., clip=(-500., 500.)):
super(BoltzmannQPolicy, self).__init__()
self.tau = tau
self.clip = clip
def select_action(self, q_values):
assert q_values.ndim == 1
q_values = q_values.astype('float64')
nb_actions = q_values.shape[0]
exp_values = np.exp(np.clip(q_values / self.tau, self.clip[0], self.clip[1]))
probs = exp_values / np.sum(exp_values)
action = np.random.choice(range(nb_actions), p=probs)
return action
In decision-making process for every action, output of linear
activation function of the last dense
layer is transformed according to Boltzmann exploration policy to range [0,1], and the decision on a specific action is made according to Boltzmann exploration. That's why softmax
is not used in an output layer.
You can read more about different exploration strategies and their comparison here: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-7-action-selection-strategies-for-exploration-d3a97b7cceaf