Search code examples
reinforcement-learningdqn

DOUBLE DQN doesn't make any sense


Why use 2 networks, train once every episode and update target network every N episode, when we can use 1 network and train it ONCE every N episode! there is literally no difference!


Solution

  • What you are describing is not Double DQN. The periodically updated target network is a core feature of the original DQN algorithm (and all of its derivatives). DeepMind's classic paper explains why it is crucial to have two networks:

    The second modification to online Q-learning aimed at further improving the stability of our method with neural networks is to use a separate network for generating the targets y_j in the Q-learning update. More precisely, every C updates we clone the network Q to obtain a target network Q^ and use Q^ for generating the Q-learning targets y_j for the following C updates to Q. This modification makes the algorithm more stable compared to standard online Q-learning, where an update that increases Q(s_t, a_t) often also increases Q(s_{t+1}, a) for all a and hence also increases the target y_j, possibly leading to oscillations or divergence of the policy. Generating the targets using an older set of parameters adds a delay between the time an update to Q is made and the time the update affects the targets y_j, making divergence or oscillations much more unlikely.