neural-network tic-tac-toe backpropagation

Neural Network playing Tic Tac Toe doesn't learn

I have a neural network playing tic-tac-toe. (I know there are other better methods for this, but I want to learn about NN) So the NN plays against a random AI. First, it should learn to make an allowed move, ie. not choosing a field that is already occupied.

It doesn't get very far with this, however. When NN chooses an illegal move I optimize the weights such that the distance to another, randomly chosen (legal) field is minimized. (There is one output which should have values between 1 and 9).

My problem is: in changing the weights, a formerly optimized outcome is now also changed. So I have this kind of overfitting: Everytime I backpropagade to optimize the weights for one particular situation, the decision for every other situation becomes worse!

I know I should probably have 9 output neurons instead of 1 and should probably not use a random field as the target, as I assume this can mess things up. I am starting to change this.

Still, the issue seems to remain. Obviously. How can I improve the decision in one situation without forgetting every other situation? One solution I came up with is to "remember" every game played and optimizing simultaneously over all games played.

However, after a while this becomes very demanding on the computation. Also, it seems to go into the direction of a complete enumartion of all possible board situations. This might be possible for Tic Tac Toe but if I move to another game, say Go, this becomes infeasible.

Where is my mistake? How do I generally tackle this problem? Or where could I read about it? Thanks a lot!

Solution

To tackle this problem efficiently, you sould consider Reinforcement Learning methods, instead of what you are currently doing. What your are trying to do is to learn the behaviour of an agent playing Tic Tac Toe. The agent gets a high reward when he wins a game, a high penalty when he loses and an even higher penalty when he performs an illegal move. My guess is that using methods such as Q-learning with neural networks will work perfectly, even with very simple neural nets. One useful paper on the topic could be: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf, or earlier papers on TD-Gammon (I think you can easily find tutorials on the topic using the keywords TD-Gammon, Q-learning, ...).

By the way, a more down-to-earth answer to why your model might not work is that you are seemingly using one single unit to represent categorical outputs: if you want to represent an integer between 1 and N, you should represent it using N output neurons with values between 0 and 1, and pick the neuron with the highest value as your answer. Using a single neuron with value between 1 and 9 creates an unatural assymetry between your outputs, and, for example, when the expected value is 3, your network gets a higher error for outputing a 9 than a 2. This should obviously not be the case: all wrong answers are equally wrong.

Hope this helps,

Best