python tensorflow reinforcement-learning

Implement simple PPO Agent in TensorFlow

I read this good article about the Proximal Policy Optimization algorithm, and now I want update my VanillaPG agent to a PPO agent to learn more about it. However, I'm still not sure how to implement this in real code, especially since I'm using a simple discrete action space.

So what I do with my VPG Agent is, if there are 3 actions, the network outputs 3 values (out), on which I use softmax (p) and use the result as a distribution to choose one of the actions. For training, I take the states, actions and advantages and use this loss function:

loss = -tf.reduce_sum(advantages * tf.log(ch_action_p_values))

How can I extend this algorithm to use PPO for discrete actions? All of the implementations I found work with continuous actions spaces. I'm not sure if I have to change my loss function to the first one used in the article. And I'm not even sure of which probabilities I have to calculate KLD. Are prob_s_a_* and D_KL single values for the whole batch, or one value for each sample? How can I calculate them in TF for my agent?

Solution

You should be able to do it also with discrete state without any problem (I never tried, though). The probability prob_s_a_* you are talking about are the probabilities of drawing the sampled actions with the current policy (one value per sample). PPO does not use D_KL (the KL divergence), as from its experiments it performed worse (they just clip the probabilities ratio).

So you need just to add a placeholder for the old log prob and clip the ratio between the new log prob (tf.log(ch_action_p_values)) and the old log ones.

Here is an example (e_clip is the clipping value, in the paper they use 0.2)

    vanilla_loss = -tf.reduce_sum(advantages * tf.log(ch_action_p_values))

    old_log_probs = tf.placeholder(...)
    log_probs = tf.log(ch_action_p_values)
    prob_ratio = tf.exp(log_prob - old_log_probs)
    clip_prob = tf.clip_by_value(prob_ratio, 1.-e_clip, 1.+e_clip)
    ppo_loss = -tf.reduce_mean(tf.minimum(tf.multiply(prob_ratio, advantages), tf.multiply(clip_prob, advantages)))

Beside the usual advantages and ch_action_p_values, you need to feed the loss with old_log_probs, computed as the log probability of the current policy on the sampled actions.