I think I am messing something up.
I always thought that:
- 1-step TD on-policy = Sarsa
- 1-step TD off-policy = Q-learning
Thus I conclude:
- n-step TD on-policy = n-step Sarsa
- n-step TD off-policy = n-step Q-learning
In Sutton's book, however, he never introduces n-step Q-Learning, but he does introduce n-step off-policy Sarsa. Now I feel confused.
Can someone help me with the naming?
Link to Sutton's book (Off-Policy n-step Sarsa on page 149)
I always thought that:
- 1-step TD on-policy = Sarsa
- 1-step TD off-policy = Q-learning
That's mostly correct, but not the full story. Q-learning is a version of off-policy 1-step temporal-difference learning, but not just that; it's specifically updating Q-values for the policy that is greedy with respect to current estimates. Off-policy value learning can be more general, it can be about learning for any target policy; Q-learning is more specific, it's specifically about having the greedy policy as target policy.
A naive extension of Q-learning to n
steps would no longer be correct, because that doesn't work for off-policy algorithms (like Q-learning). You'd have to correct for the "off-policyness" in some way; one way to do that is importance sampling. When you introduce that in a more general manner (for any possible target policy), you get the algorithm on that page you mentioned, which they refer to there as Off-policy n
-step Sarsa. I suppose that a specific instance of this algorithm, with the target policy pi
being the greedy policy with respect to Q
, could intuitively be understood as a "correct" version of n
-step Q-learning.