I could not understand how to update Q values for tic tac toe game. I read all about that but I could not imagine how to do this. I read that Q value is updated end of the game, but I haven't understand that if there is Q value for each action ?
You have a Q
value for each state-action pair. You update one Q
value after every action you perform. More precisely, if applying action a1
from state s1
gets you into state s2
and brings you some reward r
, then you update Q(s1, a1)
as follows:
Q(s1, a1) = Q(s1, a1) + learning_rate * (r + discount_factor * max Q(s2, _) - Q(s1, a1))
In many games, such as tic-tac-toe you don't get rewards until the end of the game, that's why you have to run the algorithm through several episodes. That's how information about utility of final states is propagated to other states.