Search code examples
pytorchvectorizationreinforcement-learning

How does one vectorize reinforcement learning environments?


I have a Python class that conforms to OpenAI's environment API, but it's written in non-vectorized form i.e. it receives one input action per step and returns one reward per step. How do I vectorize the environment? I haven't been able to find any clear explanation on GitHub.


Solution

  • You could write a custom class that iterates over an internal tuple of environments while maintaining the basic Gym API. In practice, there will be some differences, because the underlying environments don't terminate on the same timestep. Consequently, it's easier to combine the standard step and reset functions in one method called step. Here's an example:

    class VectorEnv:
        def __init__(self, make_env_fn, n):
            self.envs = tuple(make_env_fn() for _ in range(n))
    
        # Call this only once at the beginning of training (optional):
        def seed(self, seeds):
            assert len(self.envs) == len(seeds)
            return tuple(env.seed(s) for env, s in zip(self.envs, seeds))
    
        # Call this only once at the beginning of training:
        def reset(self):
            return tuple(env.reset() for env in self.envs)
    
        # Call this on every timestep:
        def step(self, actions):
            assert len(self.envs) == len(actions)
            return_values = []
            for env, a in zip(self.envs, actions):
                observation, reward, done, info = env.step(a)
                if done:
                    observation = env.reset()
                return_values.append((observation, reward, done, info))
            return tuple(return_values)
    
        # Call this at the end of training:
        def close(self):
            for env in self.envs:
                env.close()
    

    Then you can just instantiate it like this:

    import gym
    make_env_fn = lambda: gym.make('CartPole-v0')
    env = VectorEnv(make_env_fn, n=4)
    

    You'll have to do a little bookkeeping for your agent to handle the tuple of return values when you call step. This is also why I prefer to pass a function make_env_fn to __init__, because it's easy to add wrappers like gym.wrappers.Monitor that track statistics for each environment individually and automatically.