machine-learning state reinforcement-learning montecarlo

Why does the reinforce algorithm converge when initialized with uneven probabilites?

Given the following enviroment, why does the reinforce algorithm (or any other easy policy gradient algorithm) converge to the optimal solution of taking action b, even if the starting probability for taking action a is much higher ?

Start in state S0
Take action a ---> reward of 5

Take action b ---> reward of 10
Episode ends, start again in state s0

Solution

It will converge to the optimal solution of taking action b, because the gradient of the action with the higher reward value will allways take bigger steps in long term.

The key to this question is that the loss function:

log(probability(action))*reward

has the gradient

(1/probability) * reward

So if the model has a probability of 90%(0.9) for action a, the gradient of the loss function is 1/0.9 * reward = 1.111 * reward. If the model takes action b with probability 10%(0.1), the gradient is 1/0.1 * reward = 10 * reward. So the gradient of this run will be nine times higher. This balances the fact that the weights of actions with high probabilities will be increased more often and reduces this gradient to the reward. So the output of the model will converge to only taking the action with the most reward.