I wrote a program which contains an algorithm called distributed randomized gradient descent (DRGD). There are some internal variables in the algorithm which are used to calculate the step lengths. The training algorithms should be much complex than DRGD, so there should be more internal variables. If we preserve these variables, we can pause training and test the model; then, we will resume the training again.
It is possible to save the states of the trainer and resume training by calling the .save_states()
and .load_states()
functions on the Trainer
class during a training with MXNet Gluon.
Here is an example:
trainer = gluon.Trainer(net.collect_params(), 'adam')
trainer.save_states('training.states')
trainer.load_states('training.states')