tensorflow keras deep-learning artificial-intelligence reinforcement-learning

How to train a deep Reinforcement learning Network with Tic Tac Toe dataset

Hi evryone I am new to neural network. I wrote the following simple architecture in python where I should have 9 state in input due to the dimension of the tic tac toe board and in output I would like to predict the best action so 9 value of output evryone with a value. The higher one is the one of interest.

def agent(state_shape, action_shape):
    learning_rate = 0.001
    init = tf.keras.initializers.HeUniform()
    model = keras.Sequential()
    model.add(keras.layers.Dense(24, input_shape=state_shape, activation='relu', kernel_initializer=init))
    model.add(keras.layers.Dense(12, activation='relu', kernel_initializer=init))
    model.add(keras.layers.Dense(action_shape, activation='linear', kernel_initializer=init))
    model.compile(loss=tf.keras.losses.Huber(), optimizer=tf.keras.optimizers.Adam(lr=learning_rate), metrics=['accuracy'])
    return model

My dataset is represented by tic tac toe game play and it look like the following (I always start with no state accupated and the I populate them with specific action):

[[['_', '_', '_', '_', '_', '_', '_', '_', '_'], ['_', '_', '_', '_', 'X', '_', '_', '_', 'O'], ['O', 'X', '_', '_', 'X', '_', '_', '_', 'O'], ['O', 'X', 'X', '_', 'X', '_', '_', 'O', 'O'], ['O', 'X', 'X', 'X', 'X', 'O', '_', 'O', 'O'], ['O', 'X', 'X', 'X', 'X', 'O', 'X', 'O', 'O']], [['_', '_', '_', '_', '_', '_', '_', '_', '_'], ['O', 'X', '_', '_', '_', '_', '_', '_', '_'], ['O', 'X', '_', '_', '_', '_', '_', 'X', 'O'], ['O', 'X', '_', '_', 'X', '_', '_', 'X', 'O']]

Someone have an Idea of how I can use this data to train the model shown before?

Solution

You have two options. You could use supervised learning. Basically, given the current state, predict the correct action. Your network would have 9 output neurons, one for each state. The ground truth at each input is the location of the move on the table that was done in the dataset for that input. And you will have to train your network for both cases when it starts the first (with X) or when it starts the second (with 0). However, because you are doing supervised learning, if you train for X, then you should only select the games that ends in a win for X. Same when you train for 0. This is because in the supervised learning, you teach the network what to do, by giving labels. However, if the game leads to a defeat for X, then you can't train the network to predict the location of the moves on the table, as that will teach your system to lose, instead of winning.

The other option is reinforcement learning. Basically you could pit your network against itself and let your network play against itself. If your network won against itself, for all the move in the game you assign a reward of 1. Otherwise if it lost you assign a reward of -1. If there was a tie you assign a reward of 0 for all the moves. This will encourage or discourage the network to use those moves again. This concept of pitting two versions of the same network against each other would be similar to AlphaGo and the other more recent variants. In this case you will not need the dataset, as the dataset will be collected dynamically by having the network playing against itself. However, it may be much more complicated to code.