I just implemented Q-Learning without neural networks but I am stuck at implementing them with neural networks.
I will give you a pseudo code showing how my Q-Learning is implemented:
train(int iterations)
buffer = empty buffer
for i = 0 while i < iterations:
move = null
if random(0,1) > threshold:
move = random_move()
else
move = network_calculate_move()
input_to_network = game.getInput()
output_of_network = network.calculate(input_to_network)
game.makeMove(move)
reward = game.getReward()
maximum_next_q_value = max(network.calculate(game.getInput()))
if reward is 1 or -1: //either lost or won
output_of_network[move] = reward
else:
output_of_network[move] = reward + discount_factor * max
buffer.add(input_to_network, output_of_network)
if buffer is full:
buffer.remove_oldest()
train_network()
train_network(buffer b):
batch = b.extract_random_batch(batch_size)
for each input,output in batch:
network.train(input, output, learning_rate) //one forward/backward pass
My problem right now is that this code works for a buffer size of less than 200. For any buffer over 200, my code does not work anymore so I've got a few questions:
Is this implementation correct? (In theory)
Yes, your pseudocode does have the right approach.
How big should the batch size be compared to the buffer size
Algorithmically speaking, using larger batches in stochastic gradient descent allows you to reduce the variance of your stochastic gradient updates (by taking the average of the gradients in the batch), and this in turn allows you to take bigger step-sizes, which means the optimization algorithm will make progress faster.
The experience replay buffer stores a fixed number of recent memories, and as new ones come in, old ones are removed. When the time comes to train, we simply draw a uniform batch of random memories from the buffer, and train our network with them.
While related, there is no standard value for batch size vs. buffer size. Experimenting with these hyperparameters is one of the joys of deep reinforcement learning.
How would one usually train the network? For how long? Until a specific MSE of the whole batch is reached?
Networks are usually trained until they "converge," which means that there are repeatedly no meaningful changes in the Q-table between episodes