Given the following enviroment, why does the reinforce algorithm (or any other easy policy gradient algorithm) converge to the optimal solution of taking action b, even if the starting probability for taking action a is much higher ?
Start in state S0
Take action a ---> reward of 5
Take action b ---> reward of 10
Episode ends, start again in state s0
The key to this question is that the loss function:
log(probability(action))*reward
has the gradient
(1/probability) * reward
So if the model has a probability of 90%(0.9) for action a, the gradient of the loss function is 1/0.9 * reward = 1.111 * reward. If the model takes action b with probability 10%(0.1), the gradient is 1/0.1 * reward = 10 * reward. So the gradient of this run will be nine times higher. This balances the fact that the weights of actions with high probabilities will be increased more often and reduces this gradient to the reward. So the output of the model will converge to only taking the action with the most reward.