Search code examples
machine-learningstatereinforcement-learningmontecarlo

Why does the reinforce algorithm converge when initialized with uneven probabilites?


Given the following enviroment, why does the reinforce algorithm (or any other easy policy gradient algorithm) converge to the optimal solution of taking action b, even if the starting probability for taking action a is much higher ?

  1. Start in state S0

  2. Take action a ---> reward of 5

    Take action b ---> reward of 10

  3. Episode ends, start again in state s0


Solution

  • It will converge to the optimal solution of taking action b, because the gradient of the action with the higher reward value will allways take bigger steps in long term.

    The key to this question is that the loss function:

    log(probability(action))*reward

    has the gradient

    (1/probability) * reward

    So if the model has a probability of 90%(0.9) for action a, the gradient of the loss function is 1/0.9 * reward = 1.111 * reward. If the model takes action b with probability 10%(0.1), the gradient is 1/0.1 * reward = 10 * reward. So the gradient of this run will be nine times higher. This balances the fact that the weights of actions with high probabilities will be increased more often and reduces this gradient to the reward. So the output of the model will converge to only taking the action with the most reward.