Search code examples
reinforcement-learningsarsa

Understanding linear, gradient-descent Sarsa (based on Sutton & Barto)


I'm trying to implement linear gradient-descent Sarsa based on Sutton & Barto's Book, see the algorithm in the picture below.

However, I struggle to understand something in the algorithm:

  • Is the dimension of w and z independent of how many different actions can be taken? It seems in the book they have dimension equal to the number of features, which I would say is independent on how many actions.
  • Is there a w and a z for each action? Also, I cannot see in the book that this should be the case.
  • If I am right in the two bullets above, then I cannot see how the index list F_a will depend on the actions, and therefore I cannot see how the action-value function q_a can depend on the actions (see the lines marked with yellow below in the algorithm) But the action-value must depend on the actions. So there is something I am not getting...

I hope anyone can help clarifying this for me :)

Sarsa algo


Solution

  • w is the weight vector for the function approximator. The function you are approximating is Q(s,a), the action-value function, which tells you the value of taking an action in a state. Its up to you to define the weights, but yes, you're right, you need to think about how you want to represent the actions in the weights. One way might be to define a set of state features, then instantiate them once per action (multiple separate w vectors). For convenience's sake, you could then concatenate these vectors into one big w, because you know that only chunks of weight vector that were activated by a state-action pair's features would be updated. Having multiple disjoint sets of state features per action is a lot of weights if the action space is large though, so you might compress multiple actions into different scalar values of a single weight. If the true Q values are close between actions, you'll be able to perform just as well, and you'll actually learn faster because there are fewer weights that need to be optimized. The representation is flexible. It's up to you!

    I encourage you to look at the algorithm as written in the second edition of the book (drafts are available from the authors' sites). The notation is clearer. The algorithm you posted is actually a lambda return method, which you can read about in chapter 12 (z is an eligibility trace, it has the same dimension as w and isn't critical to the question you are asking). Episodic semi-gradient Sarsa, the same algorithm minus some bells and whistles, appears in section 10.1.