Assume that I am training a neural network model. I am storing the tensor file of the neural network model for every 15 epochs in .pth
format.
I need to run 1000 epochs in total. Suppose I stopped my program during the 501st epoch, then I have the following files
15.pth, 30.pth, 45.pth, 60.pth, 75.pth,.... 420.pth, 435.pth, 450.pth, 465.pth, 480.pth, 495.pth
Then my doubt is
Is it possible to use the last stored model 495.pth
and continue execution as it generally happens if done without any interruption? In short, I am asking for something similar to the "resumption" of the training phase with a few modifications to the existing code. I am just asking for such a possibility.
I am asking for general practice and not particular to any code. If such a method exists, I will be free to stop any program under execution and can resume later. Currently, I cannot use resources for shorter programs if longer programs are in execution and hence I am asking this question.
I order to resume training from a checkpoint, you need to save the entire state of your training process. This includes:
If you saved all this information, you should be able to fully restore the "state" of your training process and resume from that point.