python deep-learning reinforcement-learning markov-decision-process mdp

Why does initialising the variable inside or outside of the loop change the code behaviour?

I am implementing policy iteration in python for the gridworld environment as a part of my learning. I have written the following code:

### POLICY ITERATION ###
def policy_iter(grid, policy):
    '''
        Perform policy iteration to find the best policy and its value
    '''
    i = 1   
    while True:
        policy_converged = True # flag to check if the policy imporved and break out of the loop
        # evaluate the value function for the older policy
        old_v = value_eval(grid, policy)

        # evaluate the new policy
        for s in states:
            new_a = ""
            best_v = float("-inf")
            if grid.is_terminal(s):
                continue
            old_a = policy[s]
            for a in ACTION_SPACE:
                v = 0
                for s2 in states:
                    env_prob = transition_probs.get((s,a,s2), 0)
                    reward = rewards.get((s,a,s2), 0)

                    v += env_prob * (reward + gamma*old_v[s2])
                if v > best_v:
                    new_a = a
                    best_v = v
            policy[s] = new_a
            if new_a != old_a:
                policy_converged = False
        print(i, "th iteration")
        i += 1
        if policy_converged == True:
            break

    return policy

This code works fine. However, when I just change the placement of the '''policy_converged''' variable to be declared outside of the for loop,

def policy_iter(grid, policy):
'''
    Perform policy iteration to find the best policy and its value
'''
i = 1  
policy_converged = True
while True:

and the rest of the code remains the same. In this case, the program starts to go in an infinite loop and never stops even though I am changing the value of the flag based on the performance after each iteration inside the primary while loop. Why does this happen?

Solution

The loop only exits (via if policy_converged == True: break) if policy_converged is True. But if you move the only line that sets this variable to True to be before the loop, then if on the first iteration that variable is False, there is no way to ever set it to True, and thus no way to exit the loop.

You should re-think your loop termination logic, and make sure there is a way for policy_converged to be set to True within the loop.