Objective function in proximal policy optimization

In PPO’s objective function second term introduces squared error loss of the value function neural network. Is that term is essentially the squared advantage values, right?

Solution

No, that's the TD error for training V. You can separate the two losses and nothing changes, because the networks do not share parameters. In practice, the policy is trained on the first term of the equation, while V is trained on the second.

Reinforcement Learning Gymnasium ValueError
Too many values in Observation space: Box
Action masking for continuous action space in reinforcement learning
torchrl: Using SyncDataCollector with a custom pytorch dqn
Stable Baselines3 PPO() - how to change clip_range parameter during training?
Q-learning with a state-action-state reward structure and a Q-matrix with states as rows and actions as columns
List all environment id in openai gym
DQN model either doesn't work or it is extremely slow in training
Deep reinforcement learning - how to deal with boundaries in action space
QLearning and never-ending episodes
Good implementations of reinforcement learning?
Understanding policy and value functions reinforcement learning
Best practices for exploration/exploitation in Reinforcement Learning
exploration and exploitation in Q-learning
iterations and reward in q-learning
Reinforcement learning toy project
Reinforcement Learning
Reinforcement learning And POMDP
Reinforcement learning with neural networks
Reinforcement learning algorithms for continuous states, discrete actions
Supervised learning v.s. offline (batch) reinforcement learning
How to do backpropagation in PyTorch when training AlphaZero?
Gymnasium custom environment "too many values to unpack" error
How to train an artificial neural network to play Diablo 2 using visual input?
PPO agent is not learning
Pytorch Geometric graph batching not using DataLoader for Reinforcement learning
Keras-rl2 error Compability with Tensorflow
Monte Carlo Method for Blackjack: strange Q-values table
Performance issue with gradient-bandit agent
Probability 0 in Importance Sampling