Search code examples
pythonreinforcement-learninggridworld

Python native gridworld implementation (no NumPy)


I've implemented gridworld example from the book Reinforcement Learning - An Introduction, second edition" from Richard S. Sutton and Andrew G. Barto, Chapter 4, sections 4.1 and 4.2, page 80.

Here is my implementation: https://github.com/ozrentk/dynamic-programming-gridworld-playground

The original algorithm seems to have a bug since the value function (mapping) is updated one by one value in the source mapping structure. Why is that incorrect? It means that inside the loop for each s (of set S), in the same evaluation loop pass, the next value of the element s (e.g. s_2 of set S) will be evaluated from the newly evaluated element in that pass (e.g. s_1 of set S), instead of s from the current iteration. This problem is avoided here using the double buffering technique. An additional buffer is used for new values of set S. It also means that the program uses more memory because of that buffer.

I must admit that I'm not 100% sure if this is a bug, or if I misinterpreted the algorithm. Generally, this is the code I'm using:

...

while True:
  delta = 0

  # NOTE: algorithm modified a bit, additional buffer new_values introduced
  # Barto & Sutton seem to have a bug in their algorithm (iterative estimation does not fit figure 4.1)
  # Instead of tracking one state value inside a loop, we track entire state value function mapping
  # outside that loop. Also note that after that change algorithm consumes more memory.
  new_values = [0] * states_n
  for s in non_terminal_states:
    # Evaluate state value under current policy
    next_states = get_next_states(s, policy[s], world_size)
    sum_items = [p * (reward + gamma * values[s_next]) for s_next, p in next_states.items()]
    new_values[s] = sum(sum_items)

    # Track delta
    delta = max(delta, abs(values[s] - new_values[s]))
  
  # (now we switch state value function buffer, instead of switching single state value in the loop)
  values = new_values

  if use_policy_improvement:
    # Policy_improvement is done inside improve_policy(), and if new policy is no better than the 
    # old one, return value of is_policy_stable is True
    is_policy_stable, improved_policy = improve_policy()
    if is_policy_stable:
      print("Policy is stable.")
      break
    else:
      print("- Improving policy... ----------------")
      policy = improved_policy
      visualize_policy(policy, states, world_size)

  # In case we don't track policy improvement, we need to track delta for the convergence sake
  if delta < theta:
    break

  # Track iteration count
  k += 1

...

Am I wrong or there is a problem with the policy evaluation part of the algorithm in the book?

Policy Evaluation


Solution

  • The original algorithm is the "asynchronous version" of policy evaluation. And your impletation using two buffer is the "synchronous version". Both are correct.

    The "asynchronous version" also converge to the optimal solution(You can find the proof in book Parallel and Distributed Computation: Numerical Methods). And as you may find in the book, it "usually converges faster".

    I find that This link provides a good explanation.