Search code examples
machine-learningartificial-intelligencetic-tac-toereinforcement-learningq-learning

Q Learning Algorithm for Tic Tac Toe


I could not understand how to update Q values for tic tac toe game. I read all about that but I could not imagine how to do this. I read that Q value is updated end of the game, but I haven't understand that if there is Q value for each action ?


Solution

  • You have a Q value for each state-action pair. You update one Q value after every action you perform. More precisely, if applying action a1 from state s1 gets you into state s2 and brings you some reward r, then you update Q(s1, a1) as follows:

    Q(s1, a1) = Q(s1, a1) + learning_rate * (r + discount_factor * max Q(s2, _) - Q(s1, a1))
    

    In many games, such as tic-tac-toe you don't get rewards until the end of the game, that's why you have to run the algorithm through several episodes. That's how information about utility of final states is propagated to other states.