I am using a deep reinforcement learning approach to predict time series behavior. I am quite a newbie on that so my question is more conceptual than a computer programming one. My colleague has given me the following chart, with training, validation and testing accuracy of time series data classification using deep reinforcement learning.
From this graph, it's possible to see that both validation and testing accuracies are random, so, of course, the agent is overfitting.
But what makes me more surprised (maybe because of lack of knowledge, and that is why I am here to ask you), is how my colleague trains his agent. In the X-axis of this chart, you can find the "epoch" number (or iteration). In other words, the agent is fitted (or trained) several times as in the code below:
#initiating the agent
self.agent = DQNAgent(model=self.model, policy=self.policy,
nb_actions=self.nbActions, memory=self.memory, nb_steps_warmup=200,
target_model_update=1e-1,
enable_double_dqn=True,enable_dueling_network=True)
#Compile the agent with the Adam optimizer and with the mean absolute error metric
self.agent.compile(Adam(lr=1e-3), metrics=['mae'])
#there will be 100 iterations, I will fit and test the agent 100 times
for i in range(0,100):
#delete previous environments and create new ones
del(trainEnv)
trainEnv = SpEnv(parameters)
del(validEnv)
validEnv=SpEnv(parameters)
del(testEnv)
testEnv=SpEnv(parameters)
#Reset the callbacks used to show the metrics while training, validating and testing
self.trainer.reset()
self.validator.reset()
self.tester.reset()
####TRAINING STEP####
#Reset the training environment
trainEnv.resetEnv()
#Train the agent
self.agent.fit(trainEnv,nb_steps=floor(self.trainSize.days-self.trainSize.days*0.2),visualize=False,verbose=0)
#Get metrics from the train callback
(metrics)=self.trainer.getInfo()
#################################
####VALIDATION STEP####
#Reset the validation environment
validEnv.resetEnv()
#Test the agent on validation data
self.agent.test(validEnv,other_parameters)
#Get the info from the validation callback
(metrics)=self.validator.getInfo()
####################################
####TEST STEP####
#Reset the testing environment
testEnv.resetEnv()
#Test the agent on testing data
self.agent.test(testEnv,nb_episodes=floor(self.validationSize.days-self.validationSize.days*0.2),visualize=False,verbose=0)
#Get the info from the testing callback
(metrics)=self.tester.getInfo()
What is strange to me according to the chart and the code is that the agent is fitted several times, independently from each other, but the training accuracy increases with the time. It seems that previous experiences are helping the agent to increase the training accuracy. But how can that be possible if the environments are reset and the agent has simply fitted again? is there any backpropagation of error from previous fittings that are helping the agent to increase its accuracy in the next fittings?
What is reset is the environment, not the agent. So the agent actually accumulates experience from every iteration.