Search code examples
pythonartificial-intelligencereinforcement-learning

Base cases for value iteration in reinforcement learning


I am trying to implement value iteration for the '3x4 windy gridworld' MDP and am having trouble with understanding the Bellman equation and its implementation.

The form of Bellman equation that I am working with is this

enter image description here

Suppose this is the gridword I am working with and I want to find the value(U(s)) of the tile marked X.

enter image description here

(Image snapshot from this video)

The reward at all the tiles expect the terminal tiles is defined to be zero and it is also assumed that if one tries to make a move in a particular direction, there is a small probability that the the actual move will take place at right angles to the intended move. (If you try to move down from x, you will go down with probability 0.8 but will move either left or right with probability 0.1 each)

Now, when you try to unravel the bellman equation for the position x, there are three neighbors(U(s')) for the action 'UP'. The original location itself(since it can't move up) with a probability 0.8, the +1 state to its right with a probability 0.1 and the tile left to it also with a probability 0.1. These form the s' states.

So a function to find the value of the state X would recursively call all the states s''s. The +1 state out of this is not a problem since it is a terminal tile and that would constitute for the base case. But one of those state is the original state X itself and I don't understand how that case will ever terminate in the recursive call. Same problem with the third tile as well; will it ever terminate after all the calls to it's neighbors and so on?


Solution

  • Value iteration doesn't terminate on its own; it converges asymptotically to the correct values as long as you have γ < 1 and rewards that aren't infinite.

    In practice, you can terminate whenever the discount term (which exponetiates by γ at each level of recursion) becomes so small that continuing to calculate the next U(s') would have no impact on the value you have already accumulated.