So I'm trying to train an agent on my custom gymnasium
environment trough stablebaselines3
and it kept crashing seemingly random and throwing the following ValueError
:
Traceback (most recent call last):
File "C:\Users\bo112\PycharmProjects\ecocharge\code\Simulation Env\prototype_visu.py", line 684, in <module>
model.learn(total_timesteps=time_steps, tb_log_name=log_name)
File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\ppo\ppo.py", line 315, in learn
return super().learn(
File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py", line 277, in learn
continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py", line 218, in collect_rollouts
terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]
File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\policies.py", line 256, in obs_to_tensor
vectorized_env = vectorized_env or is_vectorized_observation(obs_, obs_space)
File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\utils.py", line 399, in is_vectorized_observation
return is_vec_obs_func(observation, observation_space) # type: ignore[operator]
File "C:\Users\bo112\PycharmProjects\ecocharge\venv\lib\site-packages\stable_baselines3\common\utils.py", line 266, in is_vectorized_box_observation
raise ValueError(
ValueError: Error: Unexpected observation shape () for Box environment, please use (1,) or (n_env, 1) for the observation shape.
I don't know why the observation shape/content would change though, since it doesn't change how the state gets its values at all.
I figured out that it crashes, whenever the agent 'survives' a whole episode for the first time and truncation gets used instead of termination. Is there some kind of weird quirk for returning truncated
and terminated
that I don't know about? Because I can't find the error in my step function.
def step(self, action):
... # handling the action etc.
reward = 0
truncated = False
terminated = False
# Check if time is over/score too low - else reward function
if self.n_step >= self.max_steps:
truncated = True
print('truncated')
elif self.score < -1000:
terminated = True
# print('terminated')
else:
reward = self.reward_fnc_distance()
self.score += reward
self.d_score.append(self.score)
self.n_step += 1
# state: [current power, peak power, fridge 1 temp, fridge 2 temp, [...] , fridge n temp]
self.state['current_power'] = self.d_power_sum[-1]
self.state['peak_power'] = self.peak_power
for i in range(self.n_fridges):
self.state[f'fridge{i}_temp'] = self.d_fridges_temp[i][-1]
self.state[f'fridge{i}_on'] = self.fridges[i].on
if self.logging:
print(f'score: {self.score}')
if (truncated or terminated) and self.logging:
self.save_run()
return self.state, reward, terminated, truncated, {}
This is the general setup for training my models:
hidden_layer = [64, 64, 32]
time_steps = 1000_000
learning_rate = 0.003
log_name = f'PPO_{int(time_steps/1000)}k_lr{str(learning_rate).replace(".", "_")}'
vec_env = make_vec_env(env_id=ChargeEnv, n_envs=4)
model = PPO('MultiInputPolicy', vec_env, verbose=1, tensorboard_log='tensorboard_logs/',
policy_kwargs={'net_arch': hidden_layer, 'activation_fn': th.nn.ReLU}, learning_rate=learning_rate,
device=th.device("cuda" if th.cuda.is_available() else "cpu"), batch_size=128)
model.learn(total_timesteps=time_steps, tb_log_name=log_name)
model.save(f'models/{log_name}')
vec_env.close()
As mentioned above, episodes only get truncated
when it also throws the ValueError
and vice versa, so I'm pretty sure it has to be that.
EDIT:
From the answer below, I found the problem was to simply put all my float/Box values of self.state
into numpy arrays before returning them like following:
self.state['current_power'] = np.array([self.d_power_sum[-1]], dtype='float32')
self.state['peak_power'] = np.array([self.peak_power], dtype='float32')
for i in range(self.n_fridges):
self.state[f'fridge{i}_temp'] = np.array([self.d_fridges_temp[i][-1]], dtype='float32')
self.state[f'fridge{i}_on'] = self.fridges[i].on
(Note: the dtype specification is not necessary in itself, it's just important for using the SubprocVecEnv
from stable_baselines3
)
The problem is most likely in your custom environment definition (ChargeEnv
). The error says that it has a wrong observation shape (it is empty). You should check your ChargeEnv.observation_space
.
If you want to create a custom environment, make sure to read the documentation to set it up correctly (https://gymnasium.farama.org/tutorials/gymnasium_basics/environment_creation/#declaration-and-initialization, https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html).
This is an example implementation of your ChargeEnv
, where the observation space is defined correctly:
import gymnasium as gym
from gymnasium import spaces
class ChargeEnv(gym.Env):
def __init__(self, n_fridges=2):
super().__init__()
# Define observation space
observation_space_dict = {
'current_power': spaces.Box(low=0, high=100, shape=(1,), dtype=np.float32),
'peak_power': spaces.Box(low=0, high=100, shape=(1,), dtype=np.float32)
}
for i in range(n_fridges):
observation_space_dict[f'fridge{i}_temp'] = spaces.Box(low=-10, high=50, shape=(1,), dtype=np.float32)
observation_space_dict[f'fridge{i}_on'] = spaces.Discrete(2) # 0 or 1 (off or on)
self.observation_space = spaces.Dict(observation_space_dict)
# Other environment-specific variables
self.n_fridges = n_fridges
# Initialize other variables as needed
def reset(self):
# Reset environment to initial state
# Initialize state variables, e.g., current_power, peak_power, fridge temperatures, etc.
# Return initial observation
initial_observation = {
'current_power': np.array([50.0]),
'peak_power': np.array([100.0])
}
for i in range(self.n_fridges):
initial_observation[f'fridge{i}_temp'] = np.array([25.0]) # Example initial temperature
initial_observation[f'fridge{i}_on'] = 0 # Example: Fridge initially off
return initial_observation
def step(self, action):
# Implement step logic (similar to your existing step function)
# Update state variables, compute rewards, check termination conditions, etc.
# Return observation, reward, done flag, and additional info