While studying MDP via different sources, I came across two different formulas for the Value update in Value-Iteration algorithm.
The first one is (the one on Wikipedia and a couple of books):
.
And the second one is (in some questions here on stack, and my course slides) :
For a specific iteration, they don't seem to give the same answer. Is one of them converging faster to the solution ?
Actually the difference is in reward functions R(s , s') or R(s) in the second formula.
First equation is generalized.
In the first one, the rewards is Ra(s , s') when transitioning from state s
to the state s'
due action a'
.
Reward could be different for different states and actions.
But if for every state s
we have some pre-defined reward(regardless of the previous state and the action that leads to s
), then we can simplify the formula to the second one.
The final values are not necessarily equal but the policies are same.