machine-learning deep-learning artificial-intelligence reinforcement-learning keras-rl

Deep Reinforcement Learning Training Accuracy

I am using a deep reinforcement learning approach to predict time series behavior. I am quite a newbie on that so my question is more conceptual than a computer programming one. My colleague has given me the following chart, with training, validation and testing accuracy of time series data classification using deep reinforcement learning.

From this graph, it's possible to see that both validation and testing accuracies are random, so, of course, the agent is overfitting.

But what makes me more surprised (maybe because of lack of knowledge, and that is why I am here to ask you), is how my colleague trains his agent. In the X-axis of this chart, you can find the "epoch" number (or iteration). In other words, the agent is fitted (or trained) several times as in the code below:

#initiating the agent

self.agent = DQNAgent(model=self.model, policy=self.policy, 
nb_actions=self.nbActions, memory=self.memory, nb_steps_warmup=200, 
target_model_update=1e-1, 
enable_double_dqn=True,enable_dueling_network=True)

#Compile the agent with the Adam optimizer and with the mean absolute error metric

self.agent.compile(Adam(lr=1e-3), metrics=['mae'])

#there will be 100 iterations, I will fit and test the agent 100 times
for i in range(0,100):
    #delete previous environments and create new ones         
    del(trainEnv)       
    trainEnv = SpEnv(parameters)
    del(validEnv)
    validEnv=SpEnv(parameters)
    del(testEnv)
    testEnv=SpEnv(parameters)

   #Reset the callbacks used to show the metrics while training, validating and testing
   self.trainer.reset()
   self.validator.reset()
   self.tester.reset()

   ####TRAINING STEP#### 
   #Reset the training environment
   trainEnv.resetEnv()
   #Train the agent
   self.agent.fit(trainEnv,nb_steps=floor(self.trainSize.days-self.trainSize.days*0.2),visualize=False,verbose=0)
   #Get metrics from the train callback  
   (metrics)=self.trainer.getInfo()
   #################################

   ####VALIDATION STEP####
   #Reset the validation environment
   validEnv.resetEnv()
   #Test the agent on validation data
   self.agent.test(validEnv,other_parameters)
   #Get the info from the validation callback
   (metrics)=self.validator.getInfo()
   ####################################             

   ####TEST STEP####
   #Reset the testing environment
   testEnv.resetEnv()
   #Test the agent on testing data            
   self.agent.test(testEnv,nb_episodes=floor(self.validationSize.days-self.validationSize.days*0.2),visualize=False,verbose=0)
   #Get the info from the testing callback
   (metrics)=self.tester.getInfo()

What is strange to me according to the chart and the code is that the agent is fitted several times, independently from each other, but the training accuracy increases with the time. It seems that previous experiences are helping the agent to increase the training accuracy. But how can that be possible if the environments are reset and the agent has simply fitted again? is there any backpropagation of error from previous fittings that are helping the agent to increase its accuracy in the next fittings?

Solution

What is reset is the environment, not the agent. So the agent actually accumulates experience from every iteration.