This is not in fact this question here... CNTK python api - continue training a model They are related, however they are not the same.
I trained a model for 1500 epochs and was getting an average loss of 67% or so. I then want to continue training, which I have coded as follows:
def Create_Trainer(train_reader, minibatch_size, epoch_size, checkpoint_path=None, distributed_after=INFINITE_SAMPLES):
#Create Model with Params
lr_per_minibatch = learning_rate_schedule(
[0.01] * 10 + [0.003] * 10 + [0.001], UnitType.minibatch, epoch_size)
momentum_time_constant = momentum_as_time_constant_schedule(
-minibatch_size / np.log(0.9))
l2_reg_weight = 0.0001
input_var = input_variable((num_channels, image_height, image_width))
label_var = input_variable((num_classes))
feature_scale = 1.0 / 256.0
input_var_norm = element_times(feature_scale, input_var)
z = create_model(input_var_norm, num_classes)
#Create Error Functions
if(checkpoint_path):
print('Loaded Checkpoint!')
z.load_model(checkpoint_path)
ce = cross_entropy_with_softmax(z, label_var)
pe = classification_error(z, label_var)
#Create Learner
learner = momentum_sgd(z.parameters,
lr=lr_per_minibatch, momentum=momentum_time_constant,
l2_regularization_weight=l2_reg_weight)
if(distributed_after != INFINITE_SAMPLES):
learner = distributed.data_parallel_distributed_learner(
learner = learner,
num_quantization_bits = 1,
distributed_after = distributed_after
)
input_map = {
input_var: train_reader.streams.features,
label_var: train_reader.streams.labels
}
return Trainer(z, ce, pe, learner), input_map
notice the line of code: if(checkpoint_path): about halfway down.
I load the .dnn file from previous training, which is saved via this function...
if current_epoch % checkpoint_frequency == 0:
trainer.save_checkpoint(os.path.join(checkpoint_path + "_{}.dnn".format(current_epoch)))
This actually produces a .dnn and a .dnn.ckp file. Obviously I only load the .dnn file in the load_model.
When I restart training and it loads the model, it appears as though it may be loading the network architecture, but perhaps not the weights? What is the correct methodology for doing this?
THANKS!
You need to use trainer.restore_from_checkpoint instead, this should recreate the trainer and learners.
Soon the will be a training session that will allow seamless restore in an easy manner, taking care of trainer/minibatch/distributed state.
One important thing: in your python script, the network structure and the order in which you create your nodes must be the same at the point when you create a checkpoint and when you restore from it.