machine-learning deep-learning neural-network generative-adversarial-network

Is it possible to stop machine learning training and continue it later?

What I'm doing right now is GAN. In fact, we don't know what CNN and RNN are doing right away with GAN, and since the computing environment is an individual, we're working on a paid version of "Google Colab". I understand that Google Colab can use the development environment for up to 24 hours (not exactly 24 hours).

As a result, it takes some time to machine-learning and it gets cut off in the middle, and so far, I've been forced to learn with datasets and epokes that can be trained within 24 hours, and suddenly I thought, "Can't we accumulate training?"

So my question is, for example, if I was training with a maximum of 100 epokes and it got cut off after only 50 times, can't I start at 50 for the next training? If we can do that, wouldn't it be possible to do long hours of training as Google Colab can continue training every 24 hours despite the 24-hour limit? That's what I thought.

Is this possible?

Solution

In frameworks like PyTorch or Tensorflow it is pretty simple. You can save the weights of your model and then to restore those weights later all you have to do is make an instance of your model and load your weights.

For PyTorch, you basically do this:

torch.save(model.state_dict(), path_to_save_to)

When you want to load saved weights:

model = ModelClass()

model.load_state_dict(torch.load(path_saved_to)

You may want to save after every epoch or after every n epochs or only when your model's performance goes up etc.

If you are not using any frameworks, it is possible even then as well. You can save your model weights in a Numpy array which you can then save to your Gdrive in several ways. When needed again, instantiate your model, instead of randomly initializing your parameters, set them to your loaded Numpy array.