deep-learning reinforcement-learning nonlinear-functions dqn

Why does randomizing samples of reinforcement learning model with a non-linear function approximator reduce variance?

I have read the DQN thesis.

While reading the DQN paper, I found that randomly selecting and learning samples reduced divergence in RL using a non-linier function approximator.

If so, why is the learning of RL using a non-linier function approximator divergent when the input data are strongly correlated?

Solution

I believe that Section X (starting on page 687) of An Analysis Of Temporal-Difference Learning with Function Approximation provides an answer to your question. In summary, there exist nonlinear functions whose average prediction error actually increases after applying the TD(0) Bellman operator; hence, the policy will eventually diverge. This is generally the case for deep neural networks because they are inherently nonlinear and tend to be poorly behaved from an optimization perspective.

Alternatively, training on independent and identically distributed (i.i.d.) data makes it possible to compute unbiased estimates of the gradient, which is required for first-order optimization algorithms like Stochastic Gradient Descent (SGD) to converge to a local minimum of the loss function. This is why DQN samples random minibatches from a large replay memory then reduces the loss using RMSProp (an advanced form of SGD).