Search code examples
reinforcement-learning

Objective function in proximal policy optimization


In PPO’s objective function second term introduces squared error loss of the value function neural network. Is that term is essentially the squared advantage values, right?


Solution

  • No, that's the TD error for training V. You can separate the two losses and nothing changes, because the networks do not share parameters. In practice, the policy is trained on the first term of the equation, while V is trained on the second.