Search code examples
formulamdpvalue-iteration

Are these two different formulas for Value-Iteration update equivalent?


While studying MDP via different sources, I came across two different formulas for the Value update in Value-Iteration algorithm.

The first one is (the one on Wikipedia and a couple of books):

First equation .
And the second one is (in some questions here on stack, and my course slides) : Second equation

For a specific iteration, they don't seem to give the same answer. Is one of them converging faster to the solution ?


Solution

  • Actually the difference is in reward functions R(s , s') or R(s) in the second formula.

    First equation is generalized.

    In the first one, the rewards is Ra(s , s') when transitioning from state s to the state s' due action a'. Reward could be different for different states and actions.

    But if for every state s we have some pre-defined reward(regardless of the previous state and the action that leads to s), then we can simplify the formula to the second one.

    The final values are not necessarily equal but the policies are same.