Search code examples
deep-learningneural-networkpytorch

Is it possible to execute from the point where the neural network model is interrupted?


Assume that I am training a neural network model. I am storing the tensor file of the neural network model for every 15 epochs in .pth format.

I need to run 1000 epochs in total. Suppose I stopped my program during the 501st epoch, then I have the following files

15.pth, 30.pth, 45.pth, 60.pth, 75.pth,.... 420.pth, 435.pth, 450.pth, 465.pth, 480.pth, 495.pth

Then my doubt is

Is it possible to use the last stored model 495.pth and continue execution as it generally happens if done without any interruption? In short, I am asking for something similar to the "resumption" of the training phase with a few modifications to the existing code. I am just asking for such a possibility.

I am asking for general practice and not particular to any code. If such a method exists, I will be free to stop any program under execution and can resume later. Currently, I cannot use resources for shorter programs if longer programs are in execution and hence I am asking this question.


Solution

  • I order to resume training from a checkpoint, you need to save the entire state of your training process. This includes:

    1. Current weights of the model.
    2. State of the optimizer: most optimizers keep track of different statistics of the updates, e.g., momentum, variance etc.
    3. State of the learning rate scheduler.
    4. Additional "state" variables unique to your code.

    If you saved all this information, you should be able to fully restore the "state" of your training process and resume from that point.