Search code examples
pythonreinforcement-learningopenai-gymstable-baselines

Stable baselines 3 throws ValueError when episode is truncated


So I'm trying to train an agent on my custom gymnasium environment trough stablebaselines3 and it kept crashing seemingly random and throwing the following ValueError:

Traceback (most recent call last):
  File "C:\Users\bo112\PycharmProjects\ecocharge\code\Simulation Env\prototype_visu.py", line 684, in <module>
    model.learn(total_timesteps=time_steps, tb_log_name=log_name)
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\ppo\ppo.py", line 315, in learn
    return super().learn(
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py", line 277, in learn
    continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py", line 218, in collect_rollouts
    terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\policies.py", line 256, in obs_to_tensor
    vectorized_env = vectorized_env or is_vectorized_observation(obs_, obs_space)
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\utils.py", line 399, in is_vectorized_observation
    return is_vec_obs_func(observation, observation_space)  # type: ignore[operator]
  File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\utils.py", line 266, in is_vectorized_box_observation
    raise ValueError(
ValueError: Error: Unexpected observation shape () for Box environment, please use (1,) or (n_env, 1) for the observation shape.

I don't know why the observation shape/content would change though, since it doesn't change how the state gets its values at all.

I figured out that it crashes, whenever the agent 'survives' a whole episode for the first time and truncation gets used instead of termination. Is there some kind of weird quirk for returning truncated and terminated that I don't know about? Because I can't find the error in my step function.

    def step(self, action):

        ...  # handling the action etc.

        reward = 0
        truncated = False
        terminated = False
        # Check if time is over/score too low - else reward function
        if self.n_step >= self.max_steps:
            truncated = True
            print('truncated')
        elif self.score < -1000:
            terminated = True
            # print('terminated')
        else:
            reward = self.reward_fnc_distance()

        self.score += reward
        self.d_score.append(self.score)
        self.n_step += 1

        # state: [current power, peak power, fridge 1 temp, fridge 2 temp, [...] , fridge n temp]
        self.state['current_power'] = self.d_power_sum[-1]
        self.state['peak_power'] = self.peak_power
        for i in range(self.n_fridges):
            self.state[f'fridge{i}_temp'] = self.d_fridges_temp[i][-1]
            self.state[f'fridge{i}_on'] = self.fridges[i].on

        if self.logging:
            print(f'score: {self.score}')

        if (truncated or terminated) and self.logging:
            self.save_run()

        return self.state, reward, terminated, truncated, {}

This is the general setup for training my models:

hidden_layer = [64, 64, 32]
time_steps = 1000_000
learning_rate = 0.003
log_name = f'PPO_{int(time_steps/1000)}k_lr{str(learning_rate).replace(".", "_")}'
vec_env = make_vec_env(env_id=ChargeEnv, n_envs=4)
model = PPO('MultiInputPolicy', vec_env, verbose=1, tensorboard_log='tensorboard_logs/',
            policy_kwargs={'net_arch': hidden_layer, 'activation_fn': th.nn.ReLU}, learning_rate=learning_rate,
            device=th.device("cuda" if th.cuda.is_available() else "cpu"), batch_size=128)
model.learn(total_timesteps=time_steps, tb_log_name=log_name)
model.save(f'models/{log_name}')
vec_env.close()

As mentioned above, episodes only get truncated when it also throws the ValueError and vice versa, so I'm pretty sure it has to be that.


EDIT:

From the answer below, I found the problem was to simply put all my float/Box values of self.state into numpy arrays before returning them like following:

self.state['current_power'] = np.array([self.d_power_sum[-1]], dtype='float32')
self.state['peak_power'] = np.array([self.peak_power], dtype='float32')
for i in range(self.n_fridges):
    self.state[f'fridge{i}_temp'] = np.array([self.d_fridges_temp[i][-1]], dtype='float32')
    self.state[f'fridge{i}_on'] = self.fridges[i].on

(Note: the dtype specification is not necessary in itself, it's just important for using the SubprocVecEnv from stable_baselines3)


Solution

  • The problem is most likely in your custom environment definition (ChargeEnv). The error says that it has a wrong observation shape (it is empty). You should check your ChargeEnv.observation_space.

    If you want to create a custom environment, make sure to read the documentation to set it up correctly (https://gymnasium.farama.org/tutorials/gymnasium_basics/environment_creation/#declaration-and-initialization, https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html).

    This is an example implementation of your ChargeEnv, where the observation space is defined correctly:

    import gymnasium as gym
    from gymnasium import spaces
    
    class ChargeEnv(gym.Env):
        def __init__(self, n_fridges=2):
            super().__init__()
    
            # Define observation space
            observation_space_dict = {
                'current_power': spaces.Box(low=0, high=100, shape=(1,), dtype=np.float32),
                'peak_power': spaces.Box(low=0, high=100, shape=(1,), dtype=np.float32)
            }
    
            for i in range(n_fridges):
                observation_space_dict[f'fridge{i}_temp'] = spaces.Box(low=-10, high=50, shape=(1,), dtype=np.float32)
                observation_space_dict[f'fridge{i}_on'] = spaces.Discrete(2)  # 0 or 1 (off or on)
    
            self.observation_space = spaces.Dict(observation_space_dict)
    
            # Other environment-specific variables
            self.n_fridges = n_fridges
            # Initialize other variables as needed
    
        def reset(self):
            # Reset environment to initial state
            # Initialize state variables, e.g., current_power, peak_power, fridge temperatures, etc.
            # Return initial observation
            initial_observation = {
                'current_power': np.array([50.0]),
                'peak_power': np.array([100.0])
            }
            for i in range(self.n_fridges):
                initial_observation[f'fridge{i}_temp'] = np.array([25.0])  # Example initial temperature
                initial_observation[f'fridge{i}_on'] = 0  # Example: Fridge initially off
    
            return initial_observation
    
        def step(self, action):
            # Implement step logic (similar to your existing step function)
            # Update state variables, compute rewards, check termination conditions, etc.
            # Return observation, reward, done flag, and additional info