Search code examples
mdp

When to use Policy Iteration instead of Value Iteration


I'm currently studying dynamic programming solutions to Markov Decision Processes. I feel like I've got a decent grip on VI and PI and the motivation for PI is pretty clear to me (converging on the correct state utilities seems like unnecessary work when all we need is the correct policy). However, none of my experiments show PI in a favourable light in terms of runtime. It seems to consistently take longer regardless of the size of the state space and discount factor.

This could be due the implementation (I'm using the BURLAP library), or poor experimentation on my part. However, even the trends don't seem to show a benefit. It should be noted that the BURLAP implementation of PI is actually "modified policy iteration" which runs a limited VI variant at each iteration. My question to you is do you know of any situations, theoretical or practical, in which (modified) PI should outperform VI?


Solution

  • Turns out that Policy Iteration, specifically Modified Policy Iteration, can outperform Value Iteration when the discount factor (gamma) is very high.

    http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/kaelbling96a.pdf