Why use 2 networks, train once every episode and update target network every N episode, when we can use 1 network and train it ONCE every N episode! there is literally no difference!
What you are describing is not Double DQN. The periodically updated target network is a core feature of the original DQN algorithm (and all of its derivatives). DeepMind's classic paper explains why it is crucial to have two networks:
The second modification to online Q-learning aimed at further improving the stability of our method with neural networks is to use a separate network for generating the targets
y_j
in the Q-learning update. More precisely, everyC
updates we clone the networkQ
to obtain a target networkQ^
and useQ^
for generating the Q-learning targetsy_j
for the followingC
updates toQ
. This modification makes the algorithm more stable compared to standard online Q-learning, where an update that increasesQ(s_t, a_t)
often also increasesQ(s_{t+1}, a)
for alla
and hence also increases the targety_j
, possibly leading to oscillations or divergence of the policy. Generating the targets using an older set of parameters adds a delay between the time an update toQ
is made and the time the update affects the targetsy_j
, making divergence or oscillations much more unlikely.