I'm reading Reinforcement Learning by Sutton and Barto, and for an example of Dyna-Q, they use a maze problem. The example shows that with n=50 steps of planning, the algorithm reaches the optimal path in only 3 episodes.
Is this an improvement over 50-step Q-learning? It seems like you are really just running a bunch of 50-step Q-learning algorithms in each episode, so saying it finds the optimal path in 3 episodes is misleading.
Also, I guess the big question is, I thought Dyna-Q was useful when you don't have a model of the environment, but in this example don't we have a model of the environment? Why use all of the memory to save all our previous moves if we already have a model? I'm having trouble understanding why this is a good example for Dyna-Q.
In theory, we don't have the model. We have it in practice just for simulation, but in real life we don't.
Dyna-Q basically approximate your model using sample. Instead of learning the transition and the reward functions, you "query" your data: what happened in the past when I did action a
in state s
? If everything is deterministic, this is equivalent of knowing the exact model.
Think it also like this. In classic Q-learning your know only your current s,a
, so you update Q(s,a)
only when you visit it. In Dyna-Q, you update all Q(s,a)
every time you
query them from the memory. You don't have to revisit them. This speeds up things tremendously.
Also, the very common "replay memory" basically reinvented Dyna-Q, even though nobody acknowledges it.