Search code examples
pythonkerasreinforcement-learning

Understanding monte carlo tree search


So, I'm trying to create an implementation of AlphaZero using keras. However, I am not too sure about MCTS. My understanding and coding of Monte Carlo Tree Search is as follows:


class MCTS(object):
    def __init__(self, action_size, movesets, nsims, ndepth):
        self.nsims = nsims
        self.ndepth = ndepth
        self.movesets = movesets
        self.action_size = action_size

    def evaluate_and_act(self, agent, stateizer, critic, state):
        sims = []

        print("Beginning monte carlo tree search")

        true_state = state
        for i in range(self.nsims):
            random_walk = []
            for j in range(self.ndepth):
                random_actions = []
                print("Searching depth", j, "of simulation", i)

                for k in range(self.movesets):
                    rand_move = np.random.choice(self.action_size)
                    rand_move_matrix = cp.add(cp.zeros((1, self.action_size)), .0001)
                    rand_move_matrix[0][rand_move] = critic.predict(state, batch_size=64)[0][0]
                    random_actions.append(cp.asnumpy(rand_move_matrix))
                random_action_concat = np.concatenate(random_actions, -1)
                state = stateizer.predict(cp.asnumpy(random_action_concat), batch_size=64)
                random_walk.append(random_actions)
            sims.append(random_walk)
            state = true_state

        best_reward = -1000000.0
        for walk in sims:

            sum_reward = np.sum(walk)
            if sum_reward >= best_reward:
                best_walk = walk
                best_reward = sum_reward

        return best_walk[0]

It seems like I don't need the policy network at all in this implementation, just the critic. Can someone please help me with understanding whether or not my implementation is correct, and why it's incorrect in terms of AlphaZero? Thanks.


Solution

  • It seems to be missing the U(s, a) term from the AlphaGo paper.

    U(s, a) ∝ P(s, a) / (1 + N(s, a)) where P(s, a) is the probability of the move from the policy.