I am trying to train a DQN Agent to solve AI Gym's Cartpole-v0 environment. I have started with this person's implementation just to get some hands-on experience. What I noticed is that during training, after many episodes the agent finds the solution and is able to keep the pole upright for the maximum amount of timesteps. However, after further training, the policy looks like it becomes more stochastic and it can't keep the pole upright anymore and goes in and out of a good policy. I'm pretty confused by this why wouldn't further training and experience help the agent? At episodes my epsilon for random action becomes very low, so it should be operating on just making the next prediction. So why does it on some training episodes fail to keep the pole upright and on others it succeeds?
Here is a picture of my reward-episode curve during the training process of the above linked implementation.
This actually looks fairly normal to me, in fact I guessed your results were from CartPole before reading the whole question.
I have a few suggestions: