Search code examples
pytorchcheckpoint

Two models ( with similar hyperparameters) loaded from same checkpoint giving different training results in PYTORCH


I trained a model (Lenet-5) for 10 epochs and saved the model. loaded into 2 models ‘new_model’, ‘new_model2’ below is the colab link https://colab.research.google.com/drive/1qQhyTWNzCgMYn8t0ZtIZilLgk4JptbJG?usp=sharing

trained the new models for 5 epochs, but ended up with different train and test accuracies for each epoch, in spite of loading from same model and setting reproducibility settings.

When I continue training the original model for 5 more epochs, the results are also different from the training results of 2 new models.

Is it possible that the test and train accuracies of original model (15 epochs), 2 new models (5 epochs after loading from the checkpoint) will be same?

(After loaded checkpoint I'm getting same test accuracy for all 3 models, but results are deviating on further training of each of models.)


Solution

  • You should reset all the seeds to a fixed value right before launching your experiments every time you launch an experiment. In short, this should be the order:

    1. Set Seed
    2. Train new model #1
    3. Set Seed (again) to the same value.
    4. Train new model #2

    Reusing some of your code, we could define a function to set the seed, that should be called with the same value in steps 1 and 3:

    def set_seed(s):
       th.manual_seed(s)
       th.cuda.manual_seed_all(s)
       th.backends.cudnn.deterministic = True
       th.backends.cudnn.benchmark = False
       np.random.seed(s)
       random.seed(s)
       os.environ['PYTHONHASHSEED'] = str(s)