I'm currently using code from OpenAI baselines to train a model, using the following code in my train.py
:
from baselines.common import tf_util as U
import tensorflow as tf
import gym, logging
from visak_dartdeepmimic import VisakDartDeepMimicArgParse
def train(env, initial_params_path,
save_interval, out_prefix, num_timesteps, num_cpus):
from baselines.ppo1 import mlp_policy, pposgd_simple
sess = U.make_session(num_cpu=num_cpus).__enter__()
U.initialize()
def policy_fn(name, ob_space, ac_space):
print("Policy with name: ", name)
policy = mlp_policy.MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
hid_size=64, num_hid_layers=2)
saver = tf.train.Saver()
if initial_params_path is not None:
print("Tried to restore from ", initial_params_path)
saver.restore(tf.get_default_session(), initial_params_path)
return policy
def callback_fn(local_vars, global_vars):
iters = local_vars["iters_so_far"]
saver = tf.train.Saver()
if iters % save_interval == 0:
saver.save(sess, out_prefix + str(iters))
pposgd_simple.learn(env, policy_fn,
max_timesteps=num_timesteps,
callback=callback_fn,
timesteps_per_actorbatch=2048,
clip_param=0.2, entcoeff=0.0,
optim_epochs=10, optim_stepsize=3e-4, optim_batchsize=64,
gamma=1.0, lam=0.95, schedule='linear',
)
env.close()
Which is based off of the code that OpenAI itself provides in the baselines repository
This works fine, except that I get some pretty weird looking learning curves which I suspect are due to some hyperparameters passed to the learn
function which cause performance to decay / high variance as things go on (though I don't know for certain)
Anyways, to confirm this hypothesis I'd like to retrain the model but not from scratch: I'd like to start it off from a high point: say, iteration 1600 for which I have a saved model lying around (having saved it with saver.save
in callback_fn
So now I call the train
function, but this time I provide it with an inital_params_path
pointing to the save prefix for iteration 1600. By my understanding, the call to saver.restore
in policy_fn
should restore "reset" the model to where it was at 1teration 1600 (and I've confirmed that the load routine runs using the print statement)
However, in practice I find that it's almost like nothing gets loaded. For instance, if I got statistics like
----------------------------------
| EpLenMean | 74.2 |
| EpRewMean | 38.7 |
| EpThisIter | 209 |
| EpisodesSoFar | 662438 |
| TimeElapsed | 2.15e+04 |
| TimestepsSoFar | 26230266 |
| ev_tdlam_before | 0.95 |
| loss_ent | 2.7640965 |
| loss_kl | 0.09064759 |
| loss_pol_entpen | 0.0 |
| loss_pol_surr | -0.048767302 |
| loss_vf_loss | 3.8620138 |
----------------------------------
for iteration 1600, then for iteration 1 of the new trial (ostensibly using 1600's parameters as a starting point), I get something like
----------------------------------
| EpLenMean | 2.12 |
| EpRewMean | 0.486 |
| EpThisIter | 7676 |
| EpisodesSoFar | 7676 |
| TimeElapsed | 12.3 |
| TimestepsSoFar | 16381 |
| ev_tdlam_before | -4.47 |
| loss_ent | 45.355236 |
| loss_kl | 0.016298374 |
| loss_pol_entpen | 0.0 |
| loss_pol_surr | -0.039200217 |
| loss_vf_loss | 0.043219414 |
----------------------------------
which is back to square one (this is around where my models trained from scratch start)
The funny thing is I know that the model is being saved properly at least, since I can actually replay it using eval.py
from baselines.common import tf_util as U
from baselines.ppo1 import mlp_policy, pposgd_simple
import numpy as np
import tensorflow as tf
class PolicyLoaderAgent(object):
"""The world's simplest agent!"""
def __init__(self, param_path, obs_space, action_space):
self.action_space = action_space
self.actor = mlp_policy.MlpPolicy("pi", obs_space, action_space,
hid_size = 64, num_hid_layers=2)
U.initialize()
saver = tf.train.Saver()
saver.restore(tf.get_default_session(), param_path)
def act(self, observation, reward, done):
action2, unknown = self.actor.act(False, observation)
return action2
if __name__ == "__main__":
parser = VisakDartDeepMimicArgParse()
parser.add_argument("--params-prefix", required=True, type=str)
args = parser.parse_args()
env = parser.get_env()
U.make_session(num_cpu=1).__enter__()
U.initialize()
agent = PolicyLoaderAgent(args.params_prefix, env.observation_space, env.action_space)
while True:
ob = env.reset(0, pos_stdv=0, vel_stdv=0)
done = False
while not done:
action = agent.act(ob, reward, done)
ob, reward, done, _ = env.step(action)
env.render()
and I can clearly see that its learned something as compared to an untrained baseline. The loading action is the same across both files (or rather, if there's a mistake there then I can't find it), so it appears probable to me that train.py
is correctly loading the model and then, due to something in the pposdg_simple.learn
function's, promptly forgets about it.
Could anyone shed some light on this situation?
Not sure if this is still relevant since the baselines repository has changed quite a bit since this question was posted, but it seems that you are not actually initialising the variables before restoring them. Try moving the call of U.initialize()
inside your policy_fn
:
def policy_fn(name, ob_space, ac_space):
print("Policy with name: ", name)
policy = mlp_policy.MlpPolicy(name=name, ob_space=ob_space,
ac_space=ac_space, hid_size=64, num_hid_layers=2)
saver = tf.train.Saver()
if initial_params_path is not None:
print("Tried to restore from ", initial_params_path)
U.initialize()
saver.restore(tf.get_default_session(), initial_params_path)
return policy