I'm trying to implement eligibility traces (forward looking), whose pseudocode can be found in the following image
I'm uncertain what the For all s, a
means (5th line from below). Where do they get that collection of s, a
from?
If it's forward-looking, do loop forward from the current state to observe s'
?
Do you adjust every single e(s, a)
?
It's unfortunate that they've reused the variables s and a in two different scopes here, but yes, you adjust all e(s,a) values, e.g.,
for every state s in your state space
for every action a in your action space
update Q(s,a)
update e(s,a)
Note what's happening here. e(s,a) is getting incremented by an exponentially decreasing amount. But right before you go into that loop, you increment the single e(s,a) corresponding to the state/action pair just visited. So that pair gets "reset" in a way -- it doesn't get the exponentially smaller update, and on the next iteration, it's update will continue to be larger than all the pairs you haven't recently visited. Every time you visit a state/action pair, you're increasing the weight it contributes to the update of Q for a few iterations.