Search code examples
pythonvalueerroropenai-gym

OpenAI GYM's env.step(): what are the values?


I am getting to know OpenAI's GYM (0.25.1) using Python3.10 with gym's environment set to 'FrozenLake-v1 (code below).

According to the documentation, calling env.step() should return a tuple containing 4 values (observation, reward, done, info). However, when running my code accordingly, I get a ValueError:

Problematic code:

observation, reward, done, info = env.step(new_action)

Error:

      3 new_action = env.action_space.sample()
----> 5 observation, reward, done, info = env.step(new_action)
      7 # here's a look at what we get back
      8 print(f"observation: {observation}, reward: {reward}, done: {done}, info: {info}")

ValueError: too many values to unpack (expected 4)

Adding one more variable fixes the error:

a, b, c, d, e = env.step(new_action)
print(a, b, c, d, e)

Output:

5 0 True True {'prob': 1.0}

My interpretation:

  • 5 should be observation
  • 0 is reward
  • prob: 1.0 is info
  • One of the True's is done

So what's the leftover boolean standing for?

Thank you for your help!


Complete code:

import gym

env = gym.make('FrozenLake-v1', new_step_api=True, render_mode='ansi') # build environment

current_obs = env.reset() # start new episode

for e in env.render():
    print(e)
    
new_action = env.action_space.sample() # random action

observation, reward, done, info = env.step(new_action) # perform action, ValueError!

for e in env.render():
    print(e)

Solution

  • From the code's docstrings:

           Returns:
               observation (object): this will be an element of the environment's :attr:`observation_space`.
                   This may, for instance, be a numpy array containing the positions and velocities of certain objects.
               reward (float): The amount of reward returned as a result of taking the action.
               terminated (bool): whether a `terminal state` (as defined under the MDP of the task) is reached.
                   In this case further step() calls could return undefined results.
               truncated (bool): whether a truncation condition outside the scope of the MDP is satisfied.
                   Typically a timelimit, but could also be used to indicate agent physically going out of bounds.
                   Can be used to end the episode prematurely before a `terminal state` is reached.
               info (dictionary): `info` contains auxiliary diagnostic information (helpful for debugging, learning, and logging).
                   This might, for instance, contain: metrics that describe the agent's performance state, variables that are
                   hidden from observations, or individual reward terms that are combined to produce the total reward.
                   It also can contain information that distinguishes truncation and termination, however this is deprecated in favour
                   of returning two booleans, and will be removed in a future version.
               (deprecated)
               done (bool): A boolean value for if the episode has ended, in which case further :meth:`step` calls will return undefined results.
                   A done signal may be emitted for different reasons: >Maybe the task underlying the environment was solved successfully,
                   a certain timelimit was exceeded, or the physics >simulation has entered an invalid state.
    

    It appears that the first boolean represents a terminated value, i.e. "whether a terminal state (as defined under the MDP of the task) is reached. In this case further step() calls could return undefined results."

    It appears that the second represents whether the value has been truncated, i.e. did your agent go out of bounds or not? From the docstring:

    "whether a truncation condition outside the scope of the MDP is satisfied. Typically a timelimit, but could also be used to indicate agent physically going out of bounds. Can be used to end the episode prematurely before a terminal state is reached."