Search code examples
reinforcement-learningdqn

Why Deep Q networks algorithm performs only one gradient descent step?


Why dqn algorithm performs only one gradient descent step, i.e. trains for only one epoch? Would not it benefit from more epochs, won’t its accuracy improve with more epochs?


Solution

  • Time efficiency.

    In theory, in the policy iteration / evaluation scheme, you should wait until convergence before moving to the next update. However, this can (a) never happen, (b) take too much. So people typically do one single step with a small learning rate in the hope that the critic (Q) is not "too wrong".

    You could try more steps, but in general how many gradient steps to do is a design choice, and they probably found that this works the best.