Search code examples
luaneural-networkconv-neural-networktorchcheckpoint

Resuming Training of CNN from checkpoint in torch7


I am training a CNN. I achieved about 0.001 l2 loss towards the end of the total epochs and saved a checkpoint. Now, when I wish to resume training, I load the checkpoint and the error I am starting with is greater than 0.008.

Here is how I am saving checkpoints:

paths.mkdir('checkpointsR3')
parametersR, gradParametersR = nil, nil -- nil them to avoid spiking memory
if epoch % 50 == 0 then
     util.save('checkpointsR3/' .. opt.name .. '_' .. (epoch+1000) .. '_net_R.t7', netR, opt.gpu)
end

Here is how I am loading a checkpoint:

-- load Residual Learner
assert(opt.net ~= '', 'provide a generator model')
netR = util.load(opt.net, opt.gpu)
netR:evaluate()

The util is a lua file directly used from soumith chintala's dcgan.torch.

I wish to know where I am going wrong and why is the l2 loss higher than it was when I trained it at that checkpoint. I checked I am loading the most trained checkpoint but still I am getting a higher error.


Solution

  • Got it. It was a fault in:

    netR:evaluate()
    

    The torch documentation here, Documentation, states that if one wants to resume training, training() should be used instead of evaluate() because it initializes BatchNormalization layers differently for training and testing.