I'm using rl coach through AWS Sagemaker, and I'm running in an issue that I struggle to understand.
I'm performing RL using AWS Sagemaker for the learning, and AWS Robomaker for the environment, like in DeepRacer which uses rl coach as well. In fact, the code only little differs with the DeepRacer code on the learning side. But the environment is completely different though.
What happens:
The agent raises an exception with the message : Failed to restore agent's checkpoint: 'main_level/agent/main/online/global_step'
The traceback points to a bug happening in this rl coach module:
File "/someverylongpath/rl_coach/architectures/tensorflow_components/savers.py", line 93, in <dictcomp>
for ph, v in zip(self._variable_placeholders, self._variables)
KeyError: 'main_level/agent/main/online/global_step'
I use just like Deepracer a patch on rl coach. One notable thing in the patch is :
- self._variables = tf.global_variables()
+ self._variables = tf.trainable_variables()
But shouldn't it result in 'main_level/agent/main/online/global_step'
not beeing in self._variables ?
The problem i think is that global_step is in self._variables, and it should not be there.
So, there's a few things I don't understand about this problem, and I'm not used to rl coach so any help would be valuable.
A few more info:
rl-coach-slim 1.0.0
and tensorflow 1.11.0
I removed the patch (technically I removed the patch command in my dockerfile that was applying it), and now it works, the model is correctly restored from the checkpoint.