python gpu reinforcement-learning openai-gym

Understanding openAI gym and Optuna hyperparameter tuning using GPU multiprocessing

I am training a reinforcement learning agent using openAI's stable-baselines. I'm also optimising the agents hyperparameters using optuna.

To speed up the process, I am using multiprocessing in different function calls. Specifically in SubprocVecEnv and study.optimize as suggested in the docs here (under 1.15.3 and 1.10.4 respectively).

import numpy as np
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import PPO2
from stable_baselines.common.policies import MlpLnLstmPolicy
import optuna

n_cpu = 4


def optimize_ppo2(trial):
    """ Learning hyperparamters we want to optimise"""
    return {
        'n_steps': int(trial.suggest_loguniform('n_steps', 16, 2048)),
        'gamma': trial.suggest_loguniform('gamma', 0.9, 0.9999),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-5, 1.),
        'ent_coef': trial.suggest_loguniform('ent_coef', 1e-8, 1e-1),
        'cliprange': trial.suggest_uniform('cliprange', 0.1, 0.4),
        'noptepochs': int(trial.suggest_loguniform('noptepochs', 1, 48)),
        'lam': trial.suggest_uniform('lam', 0.8, 1.)
    }


def optimize_agent(trial):
    """ Train the model and optimise
        Optuna maximises the negative log likelihood, so we
        need to negate the reward here
    """
    model_params = optimize_ppo2(trial)
    env = SubprocVecEnv([lambda: gym.make('CartPole-v1') for i in range(n_cpu)])
    model = PPO2(MlpLnLstmPolicy, env, verbose=0, nminibatches=1, **model_params)
    model.learn(10000)

    rewards = []
    n_episodes, reward_sum = 0, 0.0

    obs = env.reset()
    while n_episodes < 4:
        action, _ = model.predict(obs)
        obs, reward, done, _ = env.step(action)
        reward_sum += reward

        if done:
            rewards.append(reward_sum)
            reward_sum = 0.0
            n_episodes += 1
            obs = env.reset()

    last_reward = np.mean(rewards)
    trial.report(-1 * last_reward)

    return -1 * last_reward


if __name__ == '__main__':
    study = optuna.create_study(study_name='cartpol_optuna', storage='sqlite:///params.db', load_if_exists=True)
    study.optimize(optimize_agent, n_trials=1000, n_jobs=4)

I am using a GPU in the google colab environment. My question is, using multiprocessing in both the SubprocVecEnv and study.optimize methods, how can I be sure that the hyperparameter tuning is being correctly executed in the backend? In other words, how do I know there aren't results being overwritten?

In addition, is there a better way to use GPU multiprocessing in this particular use case where both the SubprocVecEnv and study.optimize can run on multiple cores? (I'm unsure if creating too many threads in the same processor will actually slow things down by creating more overhead than running on less threads).

Solution

I guess your code has the same issue as reported here. The stable-baselines library uses Tensorflow as the deep learning framework, and it can cause unintended share of a Tensorflow session among multiple trials. The trials try to update a single computational graph simultaneously and they will destroy the graph.

I think you can parallelize trials if you modify your code to use separated sessions for trials. Or you can simply remove n_jobs option from study.optimize.