I read this good article about the Proximal Policy Optimization algorithm, and now I want update my VanillaPG agent to a PPO agent to learn more about it. However, I'm still not sure how to implement this in real code, especially since I'm using a simple discrete action space.
So what I do with my VPG Agent is, if there are 3 actions, the network outputs 3 values (out), on which I use softmax (p) and use the result as a distribution to choose one of the actions. For training, I take the states, actions and advantages and use this loss function:
loss = -tf.reduce_sum(advantages * tf.log(ch_action_p_values))
How can I extend this algorithm to use PPO for discrete actions? All of the implementations I found work with continuous actions spaces. I'm not sure if I have to change my loss function to the first one used in the article. And I'm not even sure of which probabilities I have to calculate KLD. Are prob_s_a_* and D_KL single values for the whole batch, or one value for each sample? How can I calculate them in TF for my agent?
You should be able to do it also with discrete state without any problem (I never tried, though). The probability prob_s_a_*
you are talking about are the probabilities of drawing the sampled actions with the current policy (one value per sample).
PPO does not use D_KL
(the KL divergence), as from its experiments it performed worse (they just clip the probabilities ratio).
So you need just to add a placeholder for the old log prob and clip the ratio between the new log prob (tf.log(ch_action_p_values)
) and the old log ones.
Here is an example (e_clip
is the clipping value, in the paper they use 0.2)
vanilla_loss = -tf.reduce_sum(advantages * tf.log(ch_action_p_values))
old_log_probs = tf.placeholder(...)
log_probs = tf.log(ch_action_p_values)
prob_ratio = tf.exp(log_prob - old_log_probs)
clip_prob = tf.clip_by_value(prob_ratio, 1.-e_clip, 1.+e_clip)
ppo_loss = -tf.reduce_mean(tf.minimum(tf.multiply(prob_ratio, advantages), tf.multiply(clip_prob, advantages)))
Beside the usual advantages
and ch_action_p_values
, you need to feed the loss with old_log_probs
, computed as the log probability of the current policy on the sampled actions.