machine-learning deep-learning reinforcement-learning

Setting up target values for Deep Q-Learning

For standard Q-Learning combined with a neural network things are more or less easy. One stores (s,a,r,s’) during interaction with the environment and use

target = Qnew(s,a) = (1 - alpha) * Qold(s,a) + alpha * ( r + gamma * max_{a’} Qold(s’, a’) )

as target values for the neural network approximation the Q-function. So the input of the ANN is (s,a) and the output is the scalar Qnew(s,a). Deep Q-Learning papers/tutorials change the structure of the Q-function. Instead of providing a single Q-value for the pair (s,a) it should now provide the Q-Values of all possible actions for a state s, so it is Q(s) instead of Q(s,a).

There comes my problem. The data base filled with (s,a,r,s’) does for a specific state s does not contain the reward for all actions. Only for some, maybe just one action. So how to setup the target values for the network Q(s) = [Q(a_1), …. , Q(a_n) ] without having all rewards for the state s in the database? I have seen different loss functions/target values but all contain the reward.

As you see; I am puzzled. Does someone help me? There are a lot of tutorials on the web but this step is in general poorly descried and even less motivated looking at the theory…

Solution

You just get the target value corresponding to the action which exists on the observation s,a,r,s'. Basically you get the target value for all actions, and then choose the maximum of them as you wrote yourself: max_{a'} Qold(s', a'). Then, it is added to the r(s,a) and the result is the target value. For example, assume you have 10 actions, and observation is (s_0, a=5, r(s_0,a=5)=123, s_1). Then, the target value is r(s_0,a=5)+ \gamma* \max_{a'} Q_target(s_1,a'). For example, with tensorflow it could be something like:

Q_Action = tf.reduce_sum(tf.multiply(Q_values,tf.one_hot(action,output_dim)), axis = 1) # dim: [batchSize , ]

in which Q_values is of size batchSize, output_dim. So, the output is a vector of size batchSize, and then there exists a vector of same size obtained as the target value. The loss is the square of the differences of them.

When you calculate loss value, also you only run the backward on the existing action and the gradient from the other action is just zero. So, you only need the reward of the existing action.