tensorflow unity-game-engine machine-learning tensorboard

How to interpret "Value Loss" chart in TensorBoard?

I have a target-finding, obstacle-avoiding helicopter in Unity Machine Learning Agents. Looking at the TensorBoard for my training, I'm trying to get a feel for how to interpret the "Losses/Value Loss".

I've googled many articles on ML Loss, like this one, but I can't seem to get an intuitive understanding yet of what it all means for my little helicopter and possible changes I should implement, if any. (The helicopter is rewarded by getting closer and again for reaching the target, and punished by getting further or colliding. It measures a variety of things like relative speed, relative target position, ray sensors and so on, and it does basically work in target-finding, whereas more complicated maze type obstacles have not been tested or trained on yet. It's using 3 layers.) Thanks!

Solution

In reinforcement learning and specifically regarding actor/critic algorithms, value loss is the difference (or an average of many such differences) between the learning algorithm's expectation of a state's value and the empirically observed value of that state.

What is a state's value? A state's value is, in short, how much reward you can expect given that you start in that state. Immediate reward contributes completely to this amount. Reward that can possibly occur but not immediately contribute less, with more distant occurrences contributing less and less. We call this reduction in contribution to value a "discount", or we say that these rewards are "discounted".

Expected value is how much the critic part of the algorithm predicts the value to be. In the case of a critic implemented as a neural network, it's the output of the neural network with the state as its input.

Empirically observed value is the amount you get when you add up the rewards that you actually got when you left that state, plus any rewards (discounted by some amount) you got immediately after that for some number of steps (we'll say after these steps you ended up on state X), and (perhaps, depending on implementation) plus some discounted amount based on the value of state X.

In short, the smaller it is, the better it got at predicting how well it is going to perform. This doesn't mean that it gets better at playing - after all, one can be terrible at a game yet be accurate at predicting that they will lose and when they will lose if they learn to choose actions that will make them lose quickly!