python machine-learning reinforcement-learning markov

How to find out values of Policy Iteration?

My teacher gave the following problem: Consider the following MDP with 3 states and rewards. There are two possible actions - RED and BLUE. The state transitions probabilites are given on the edges, and S2 is a terminal state. Assume that the initial policy is: π(S0) = B; π(S1) = R.

We were asked for what γ values (0<γ<1) the optimal policy would be:

(a) π∗(S0) = R; π∗(S1) = B;

(b) π∗(S0) = B; π∗(S1) = R;

I've shown that for (a) the answer is γ = 0.1, and couldn't find γ values for (b) and (c). The teacher said that for (b) any γ > 0.98 would work, and for (c) γ = 0.5. I think he's wrong, and have writtenthe following python script , which follows the algorithm in the textbook (Russell and Norvig AIMA), and indeed for any γ value the only policy I get is (a). However the teacher says he's not wrong, and that my script must be buggy. How can i definitely show that such policies are impossible?

S0 = "S0"
S1 = "S1"
S2 = "S2"
BLUE = "blue"
RED = "red"
gamma = 0.5  # TODO MODIFY GAMMA HERE

# P(s'|s,a)
P_destination_start_action = \
{
(S0,S0, BLUE):0.5,(S0,S0,RED):0.9, (S0,S1,BLUE):0.8,(S0,S1,RED):0, (S0,S2, BLUE):0,(S0,S2,RED):0,
(S1,S0, BLUE):0.5,(S1,S0,RED):0, (S1,S1,BLUE):0.2,(S1,S1,RED):0.6, (S1,S2, BLUE):0,(S1,S2,RED):0,
(S2,S0, BLUE):0, (S2,S0,RED):0.1, (S2,S1,BLUE):0  ,(S2,S1,RED):0.4,(S2,S2, BLUE):1,(S2,S2,RED):1
}

class MDP:
    def __init__(self):
        self.states = [S0, S1, S2]
        self.actions = [BLUE, RED]


        self.P_dest_start_action = P_destination_start_action
        self.rewards = {S0: -2, S1: -5, S2: 0}

def POLICY_EVALUATION(policy_vec, utility_vec, mdp):
    new_utility_vector = {}
    for s in mdp.states:
        to_sum = [(mdp.P_dest_start_action[(s_tag, s, policy_vec[s])] * utility_vec[s_tag])
                  for s_tag in mdp.states]
        new_utility_vector[s] = mdp.rewards[s] + gamma * sum(to_sum)
    return new_utility_vector

def POLICY_ITERATION(mdp):
    utility_vector = {state: 0 for state in mdp.states}
    policy_vector = {S0: BLUE, S1: RED, S2: RED}
    unchanged = False

    while not unchanged:
        utility_vector = POLICY_EVALUATION(policy_vector, utility_vector, mdp)
        unchanged = True
        for s in mdp.states:
            BLUE_sum = sum([(mdp.P_dest_start_action[(s_tag, s, BLUE)] * utility_vector[s_tag])
                            for s_tag in mdp.states])
            RED_sum = sum([(mdp.P_dest_start_action[(s_tag, s, RED)] * utility_vector[s_tag])
                           for s_tag in mdp.states])
            if policy_vector[s] == RED and BLUE_sum > RED_sum:
                policy_vector[s] = BLUE
                unchanged = False

            elif policy_vector[s] == BLUE and RED_sum > BLUE_sum:
                policy_vector[s] = RED
                unchanged = False

    return policy_vector

if __name__ == "__main__":
    Q2_mdp = MDP()
    new_policy_vec = POLICY_ITERATION(Q2_mdp)
    print("===========================END===============================")
    print("S_O policy =", new_policy_vec[S0], " ,S_1 Policy =", new_policy_vec[S1])

Solution

Your teacher seems to be (mostly) right.

This does not necessarily seem like a problem that has to be addressed programmatically, it can also be solved mathematically (which is probably what your teacher did and why he could say your code must be bugged without looking at it).

Math-based solution

Let V(S, C) denote the value of choosing color C in a state S. We trivially have V(S2, C) = 0 for all colors C.

It is easy to write down the true values V(S0, R) and V(S1, R) for selecting the red action in S0 or S1, because they don't depend on the values of any other states (technically they do depend on the values of S2, but those are 0 so we can leave them out):

V(S0, R) = 0.9 * (-2 + gamma * V(S0, R))
V(S1, R) = 0.6 * (-5 + gamma * V(S1, R))

With a bit of arithmetic, these can be rewritten as:

V(S0, R) = -1.8 / (1 - 0.9 * gamma)
V(S1, R) = -3 / (1 - 0.6 * gamma)

It is also useful to observe that a policy that selects B (blue) in both states S0 as well as S1 will never ever be optimal. Such a policy would never reach S2 and simply keep collecting an infinite number of negative rewards.

Knowing that, we can easily write V(S0, B) in terms of V(S0, B) and V(S1, R). We don't have to consider a V(S1, B) term in the value V(S0, B) because it would never be optimal to play B in S1 when we're considering the case where we also play B in S0 already:

V(S0, B) = 0.5 * (-2 + gamma * V(S0, B)) + 0.5 * (-5 + gamma * V(S1, R))

which simplifies to:

V(S0, B) = -3.5 + 0.5 * gamma * V(S0, B) + 0.5 * gamma * (-3 / (1 - 0.6 * gamma))

Now that we have nice expressions for V(S0, R) and V(S0, B), we can subtract one from the other: if the expression V(S0, B) - V(S0, R) is positive, the optimal policy will play B in S0. If it is negative, R will be played instead.

With a whole lot more of arithmetic, it should be possible to solve an inequality like V(S0, B) > V(S0, R) now. A much easier solution (albeit one that your teacher probably wouldn't like you trying on an exam) is to plug the deduction of the two values (= (-3.5 + (-1.5x / (1 - 0.6x))) / (1 - 0.5x) + (1.8 / (1 - 0.9x))) into google and see where the plot intersects the x-axis: this is at x = 0.96 (e.g. gamma = 0.96). So, it appears like your teacher made a small mistake in that solution (b) actually holds for any gamma > 0.96, rather than any gamma > 0.98.

Of course, the same kind of reasoning and arithmetic will work for other value functions which I did not consider yet, such as V(S1, B).

Programming-based solution

As for why your programming-based solution doesn't work, there does indeed appear to be a small bug; in the Policy Evaluation step, you only loop through all the states once. Multiple such loops in a row may be required. See how the Russel and Norvig book indeed mentions that a modified version of Value Iteration can be used for this function, which itself keeps looping until the utilities hardly change.

Based on the pseudocode in Sutton and Barto's Reinforcement Learning book, the Policy Evaluation function can be fixed as follows:

def POLICY_EVALUATION(policy_vec, utility_vec, mdp):
    new_utility_vector = utility_vec
    delta = 100000.0

    while delta > 0.00001:
        delta = 0.0

        for s in mdp.states:
            old_vs = {s: new_utility_vector[s] for s in new_utility_vector}
            to_sum = [(mdp.P_dest_start_action[(s_tag, s, policy_vec[s])] * new_utility_vector[s_tag])
                      for s_tag in mdp.states]
            new_utility_vector[s] = mdp.rewards[s] + gamma * sum(to_sum)
            delta = max(delta, max([abs(new_utility_vector[s] - old_vs[s]) for s in old_vs]))

    return new_utility_vector

After this change, you will indeed, for example, get

===========================END===============================
('S_O policy =', 'blue', ' ,S_1 Policy =', 'red')

as output for gamma = 0.97 (not just gamma > 0.98), as expected based on the math-based solution.