Search code examples
machine-learningreinforcement-learning

Policy Iteration vs Value Iteration


In reinforcement learning, I'm trying to understand the difference between policy iteration, and value iteration. There are some general answers to this out there, but I have two specific queries which I cannot find an answer to.

1) I have heard that policy iteration "works forwards", whereas value iteration "works backwards". What does this mean? I thought that both methods just take each state, then look at all the other states it can reach, and compute the value from this -- either by marginalising over the policy's action distribution (policy iteration) or by taking that argmax with regards to the action values (value iteration). So why is there any notion of the "direction" in which each method "moves"?

2) Policy iteration requires an iterative process during policy evaluation, to find the value function -- by However, value iteration just requires one step. Why is this different? Why does value iteration converge in just one step?

Thank you!


Solution

  • The answer provided by @Nick Walker is right and quite complete, however I would like to add a graphical explanation of the difference between Value iteration and Policy iteration, which maybe help to answer the second part of your question.

    Both methods, PI and VI, follow the same working principle based in Generalized Policy Iteration. This basically means that they alternate between improving the policy (which requires knowing its value function), and computing the value function of the new, improved, policy.

    enter image description here

    At the end of this iterative process, both, the value and the policy, converge to the optimal.

    However, it was notice that is not necessary to compute exactly the full value function, instead, one step is necessary to allow the convergence. In the following figure, (b) represents the operations performed by Policy Iteration, where the full value function is computed. While (d) shows how Value iteration works.

    enter image description here

    Obviously this representation of both methods are simplistic, but it highlights the difference between the key idea behind each algorithm.