So, I'm trying to create an implementation of AlphaZero using keras. However, I am not too sure about MCTS. My understanding and coding of Monte Carlo Tree Search is as follows:
class MCTS(object):
def __init__(self, action_size, movesets, nsims, ndepth):
self.nsims = nsims
self.ndepth = ndepth
self.movesets = movesets
self.action_size = action_size
def evaluate_and_act(self, agent, stateizer, critic, state):
sims = []
print("Beginning monte carlo tree search")
true_state = state
for i in range(self.nsims):
random_walk = []
for j in range(self.ndepth):
random_actions = []
print("Searching depth", j, "of simulation", i)
for k in range(self.movesets):
rand_move = np.random.choice(self.action_size)
rand_move_matrix = cp.add(cp.zeros((1, self.action_size)), .0001)
rand_move_matrix[0][rand_move] = critic.predict(state, batch_size=64)[0][0]
random_actions.append(cp.asnumpy(rand_move_matrix))
random_action_concat = np.concatenate(random_actions, -1)
state = stateizer.predict(cp.asnumpy(random_action_concat), batch_size=64)
random_walk.append(random_actions)
sims.append(random_walk)
state = true_state
best_reward = -1000000.0
for walk in sims:
sum_reward = np.sum(walk)
if sum_reward >= best_reward:
best_walk = walk
best_reward = sum_reward
return best_walk[0]
It seems like I don't need the policy network at all in this implementation, just the critic. Can someone please help me with understanding whether or not my implementation is correct, and why it's incorrect in terms of AlphaZero? Thanks.
It seems to be missing the U(s, a)
term from the AlphaGo paper.
U(s, a) ∝ P(s, a) / (1 + N(s, a))
where P(s, a)
is the probability of the move from the policy.