I am training a CNN. I achieved about 0.001 l2 loss towards the end of the total epochs and saved a checkpoint. Now, when I wish to resume training, I load the checkpoint and the error I am starting with is greater than 0.008.
Here is how I am saving checkpoints:
paths.mkdir('checkpointsR3')
parametersR, gradParametersR = nil, nil -- nil them to avoid spiking memory
if epoch % 50 == 0 then
util.save('checkpointsR3/' .. opt.name .. '_' .. (epoch+1000) .. '_net_R.t7', netR, opt.gpu)
end
Here is how I am loading a checkpoint:
-- load Residual Learner
assert(opt.net ~= '', 'provide a generator model')
netR = util.load(opt.net, opt.gpu)
netR:evaluate()
The util is a lua file directly used from soumith chintala's dcgan.torch.
I wish to know where I am going wrong and why is the l2 loss higher than it was when I trained it at that checkpoint. I checked I am loading the most trained checkpoint but still I am getting a higher error.
Got it. It was a fault in:
netR:evaluate()
The torch documentation here, Documentation, states that if one wants to resume training, training()
should be used instead of evaluate()
because it initializes BatchNormalization
layers differently for training and testing.