When I am training my model I have the following segment:
s_t_batch, a_batch, y_batch = train_data(minibatch, model2)
# perform gradient step
loss.append(model.train_on_batch([s_t_batch, a_batch], y_batch))
where s_t, a_
corresponds to current states and actions that were taken in those states respectively. model2
is the same as model
except that model2
has an output of num_actions
and model
only outputs the value of the action that was taken in that state.
What I find strange (and is really the focus of this question) is in the function train_data
I have the line:
y_batch = r_batch + GAMMA * np.max(model.predict(s_t_batch), axis=1)
The strange part is the fact that I am using the model to generate my y_batch
as well as training on them. Doesn't this become some sort of self fulfilling prophecy? If I understand correctly, the model tries to predict the expected maximum reward. Using the same model to try and generate y_batch
is implying that it is the true model doesn't it?
The question is, 1. what is the intuition behind using the same model to generate y_batch as it is to train them. 2. (optional) does loss value mean anything. When I plot it, it seems doesn't seem to be converging, however the sum of rewards seem to be increasing (see plots in link below).
The full code can be found here, which is an implementation of Deep Q Learning on the CartPole-v0 problem:
The fact that the model trains on its own predictions is the whole point of Q-learning: it is a concept called bootstrapping, which means reusing your experience. The insight behind this is:
t
(= [s_t_batch, a_batch]
) and it's (discounted) approximation for state t+1
plus the reward (=y_batch
), it is able to measure how wrong it's prediction for Qt
is.Your loss means exactly this: for one batch, it is the mean-squared error between your model's prediction for time t
from its sole Q-Value approximation and its prediction for time t
from its Q-Value approximation for the next state and taking into account some "ground truth" from the environment, that is the reward for this timestep.
Your loss does go down it seems to me, it is however very unstable, which is a known issue of vanilla Q-Learning especially vanilla Deep Q-Learning. Look at the overview paper below to have an idea of how more complex algorithms work
I advise you to look into Temporal Difference Learning. Good ressources also are