I'd like to accumulate gradients across several batches. Training with iter_size is 2 and batch_size 16 should be the same as if I set iter_size = 1 and batch_size = 32. I suspect there is something I've missed in my code, because gradParams for both cases are not the same. I will be very appreciate if you help me to find out the problem. Here is my code:
local params, gradParams = net:getParameters()
local iter_size = 2
local batch_size = 16
local iter = 0
net:zeroGradParameters()
for i, input, target in trainset:sampleiter(batch_size) do
iter = iter + 1
-- forward
local input = input:cuda()
local target = target:cuda()
local output = net:forward(input)
local loss = criterion:forward(output, target)
local gradOutput = criterion:backward(output, target)
local gradInput = net:backward(input, gradOutput)
-- update
if iter == iter_size then
gradParams:mul(1.0/iter_size)
net:updateGradParameters(0.9)
net:updateParameters(0.01)
iter = 0
net:zeroGradParameters()
end
end
It is also worth mentioning that I manually set random seed for determinism when comparing results, so the difference is not due to random initialization of the network.
The problem was due to sampling, sampleiter returned images in different order for different batch sizes, so batches in these two cases contained different images and thus accumulated gradients were different.