I am training a reinforcement learning agent using openAI's stable-baselines. I'm also optimising the agents hyperparameters using optuna.
To speed up the process, I am using multiprocessing in different function calls. Specifically in SubprocVecEnv
and study.optimize
as suggested in the docs here (under 1.15.3 and 1.10.4 respectively).
import numpy as np
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import PPO2
from stable_baselines.common.policies import MlpLnLstmPolicy
import optuna
n_cpu = 4
def optimize_ppo2(trial):
""" Learning hyperparamters we want to optimise"""
return {
'n_steps': int(trial.suggest_loguniform('n_steps', 16, 2048)),
'gamma': trial.suggest_loguniform('gamma', 0.9, 0.9999),
'learning_rate': trial.suggest_loguniform('learning_rate', 1e-5, 1.),
'ent_coef': trial.suggest_loguniform('ent_coef', 1e-8, 1e-1),
'cliprange': trial.suggest_uniform('cliprange', 0.1, 0.4),
'noptepochs': int(trial.suggest_loguniform('noptepochs', 1, 48)),
'lam': trial.suggest_uniform('lam', 0.8, 1.)
}
def optimize_agent(trial):
""" Train the model and optimise
Optuna maximises the negative log likelihood, so we
need to negate the reward here
"""
model_params = optimize_ppo2(trial)
env = SubprocVecEnv([lambda: gym.make('CartPole-v1') for i in range(n_cpu)])
model = PPO2(MlpLnLstmPolicy, env, verbose=0, nminibatches=1, **model_params)
model.learn(10000)
rewards = []
n_episodes, reward_sum = 0, 0.0
obs = env.reset()
while n_episodes < 4:
action, _ = model.predict(obs)
obs, reward, done, _ = env.step(action)
reward_sum += reward
if done:
rewards.append(reward_sum)
reward_sum = 0.0
n_episodes += 1
obs = env.reset()
last_reward = np.mean(rewards)
trial.report(-1 * last_reward)
return -1 * last_reward
if __name__ == '__main__':
study = optuna.create_study(study_name='cartpol_optuna', storage='sqlite:///params.db', load_if_exists=True)
study.optimize(optimize_agent, n_trials=1000, n_jobs=4)
I am using a GPU in the google colab environment. My question is, using multiprocessing in both the SubprocVecEnv
and study.optimize
methods, how can I be sure that the hyperparameter tuning is being correctly executed in the backend? In other words, how do I know there aren't results being overwritten?
In addition, is there a better way to use GPU multiprocessing in this particular use case where both the SubprocVecEnv
and study.optimize
can run on multiple cores? (I'm unsure if creating too many threads in the same processor will actually slow things down by creating more overhead than running on less threads).
I guess your code has the same issue as reported here. The stable-baselines library uses Tensorflow as the deep learning framework, and it can cause unintended share of a Tensorflow session among multiple trials. The trials try to update a single computational graph simultaneously and they will destroy the graph.
I think you can parallelize trials if you modify your code to use separated sessions for trials. Or you can simply remove n_jobs
option from study.optimize
.