Search code examples
pythonreinforcement-learningopenai-gym

Understanding Gym Environment


This isn't specifically about troubleshooting code, but with helping me understand the gym Environment. I am inheriting gym.Env to create my own environment, but I am have a difficult time understanding the flow. I look through the documentation, but there are still questions and concepts that are unclear.

  1. I am still a little foggy how the actually agent knows what action to control? I know when you __init__ the class, you have to distinguish if your actions are discrete or Box, but how does the agent know what parameters in their control?

  2. When determining the lower and upper limit for the spaces.Box command, that tells the agent how big of a step-size that can take? For example, if my limits are [-1,1] they can implement any size within that domain?

  3. I saw that the limits can be [a,b], (-oo,a], [b,oo), (-oo,oo) for the limits, if need to have my observation space, I just use the np.inf command?

If there any documentation that you would recommend, that would be much appreciate.


Solution

  • 1.

    The agent does not know what the action does; that is where reinforcement learning comes in. To clarify, whenever you use the environment's step(action) method, you should do verify that the action is valid within the environment and return a reward and environment state conditional on that action.

    If you want to reference these values outside of the environment, however, you can do so and control the available actions the agent can pass in like so:

    import gym
    env = gym.make('CartPole-v0')
    actions = env.action_space.n #Number of discrete actions (2 for cartpole)
    

    Now you can create a network with an output shape of 2 - using softmax activation and taking the maximum probability for determining the agents action to take.

    2.

    The spaces are used for internal environment validation. For example, observation_space = spaces.Box(low=-1.0, high=1.0, shape=(1,), dtype=np.float32) means that the maximum value the agent will see for any variable is 1, and the minimum is -1. So you should also use these inside the step() method to make sure the environment stays within these bounds.

    This step is primarily important for others who use your environment to be able to at-a-glance identify what kind of network they will need to make in order to interface with your environment.

    3.

    Yes. np.inf and -np.inf