neural-network reinforcement-learning q-learning

Calculating Q value in dqn with experience replay

consider the Deep Q-Learning algorithm

1   initialize replay memory D
2   initialize action-value function Q with random weights
3   observe initial state s
4   repeat
5       select an action a
6           with probability ε select a random action
7           otherwise select a = argmaxa’Q(s,a’)
8       carry out action a
9       observe reward r and new state s’
10      store experience <s, a, r, s’> in replay memory D
11
12      sample random transitions <ss, aa, rr, ss’> from replay memory D
13      calculate target for each minibatch transition
14          if ss’ is terminal state then tt = rr
15          otherwise tt = rr + γmaxa’Q(ss’, aa’)
16      train the Q network using (tt - Q(ss, aa))^2 as loss
17
18      s = s'
19  until terminated

In step 16 the value of Q(ss, aa) is used to calculate the loss. When is this Q value calculated? At the time the action was taken or during the training itself?

Since replay memory only stores < s,a,r,s' > and not the q-value, is it safe to assume the q value will be calculated during the time of training?

Solution

Yes, in step 16, when training the network, you are using the the loss function (tt - Q(ss, aa))^2 because you want to update network weights in order to approximate the most recent Q-values, computed as rr + γmaxa’Q(ss’, aa’) and used as target. Therefore, Q(ss, aa) is the current estimation, which is typically computed during training time.

Here you can find a Jupyter Notebook with a simply Deep Q-learning implementation that maybe is helpful.