I am working on an implementation of the MCTS algorithm, in the context of zero-sum board games with perfect information. E.g. Chess, Go, Checkers.
As I understand, in each iteration of the algorithm, there are four steps: selection, expansion, simulation, and backpropagation.
My question is about the implementation of the opponent moves, how it should be presented in the tree, and how it should be implemented at each stage.
For example, let's imagine a game of GO, where we(black) are playing against an AI(white). When black make an action ab from the root node s0, it is then turn for white to make an action aw.
My initial thought was that each action will produce a new state. So s0 -> ab -> s1 -> aw -> s2, where each s state would represent a node. However, this would effect the selection process in MCTS. In this case, wouldn't MCTS have a tendency to explore bad aw moves? Since this will return better rewards for black.
The alternative solution I though was to combin the actions into a single node. So s0 -> ab -> aw -> s1. However, this would make the decision making more difficult, since each root level action is now associated with multiple different node.
Is there any framework which suggests how opponents should be represented in MCTS? Any help would be appreciated.
Edit 1: Since we will be playing black in the sample above, the reward function at the end of each simulation will be with respect to black. E.g. if black wins at the end of the game, the reward will be back up through all nodes, both black and white nodes. My expectation was that white node (which allowed black to win) which have high state value.
But maybe I should flip the reward when doing backpropagation? E.g. if black wins, its 1 for black node and -1 for the white node. This way, the selection function stays the same. Would this be correct?
You should run either against a known strong opponent or against the algorithm itself.
Assuming you run against your own algorithm, feed the data into it to figure out the "best" move. Make sure the algorithm works for the intended side(i.e. if you play go/chess, the easiest thing is to swap the colors of the game pieces).
If you play against yourself you basically generate twice as many data points for learning the game.
If you are just starting out it might be worth playing against some other machine player. You don't get so much data points but the ones you get teach you more (i.e. a bad move will be learned faster).
You probably want to start by playing against some reasonable, existing AI then switch to play against itself.