probability reinforcement-learning sampling

Probability 0 in Importance Sampling

I have a general question about the methods that use importance sampling in RL. What happens when the probability of either one of the policies is 0?

Solution

Assuming

b = probability of the behaviour policy
π = probability of the target policy

Then,

If π is 0 and b > 0, then the ratio π / b becomes 0 which just means the reward arising out of this action from that state is taken to be zero, while making updates to the Q table for the state preceding this. In simple words, this is not a problem and the Monte Carlo algorithm should converge.
On the other hand, the situation b is 0 and π > 0 should not arise in the first place when we choose a behaviour policy with has "coverage" with the target policy. If we choose a behaviour policy that doesn't have coverage with the target policy, then we simply don't accurately learn action value estimates in the Q table for those state, action pairs which the behaviour policy simply never explores and we can't expect convergence.

In the words of Barto and Sutton in their Reinforcement Learning book,

In order to use episodes from b to estimate values for π, we require that 
every action taken under π is also taken, at least occasionally, under b. 
That is, we require that π(a|s) > 0 implies b(a|s) > 0. This is called the 
assumption of coverage.