machine-learning pseudocode agent reinforcement-learning q-learning

Is this a correct implementation of Q-Learning for Checkers?

I am trying to understand Q-Learning,

My current algorithm operates as follows:

1. A lookup table is maintained that maps a state to information about its immediate reward and utility for each action available.

2. At each state, check to see if it is contained in the lookup table and initialise it if not (With a default utility of 0).

3. Choose an action to take with a probability of:

    (*ϵ* = 0>ϵ>1 - probability of taking a random action)
    1-ϵ = Choosing the state-action pair with the highest utility.
    ϵ = Choosing a random move.
    ϵ decreases over time.

4. Update the current state's utility based on:

    Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]

I am currently playing my agent against a simple heuristic player, who always takes the move that will give it the best immediate reward.

The results - The results are very poor, even after a couple hundred games, the Q-Learning agent is losing a lot more than it is winning. Furthermore, the change in win-rate is almost non-existent, especially after reaching a couple hundred games.

Am I missing something? I have implemented a couple agents:

(Rote-Learning, TD(0), TD(Lambda), Q-Learning)

But they all seem to be yielding similar, disappointing, results.

Solution

There are on the order of 10²⁰ different states in checkers, and you need to play a whole game for every update, so it will be a very, very long time until you get meaningful action values this way. Generally, you'd want a simplified state representation, like a neural network, to solve this kind of problem using reinforcement learning.

Also, a couple of caveats:

Ideally, you should update 1 value per game, because the moves in a single game are highly correlated.
You should initialize action values to small random values to avoid large policy changes from small Q updates.