On gihub : https://github.com/torch/tutorials/blob/master/2_supervised/4_train.lua we have a example of a script defining a training procedure. I'm interested by the construction of feval function in this script.
-- create closure to evaluate f(X) and df/dX
local feval = function(x)
-- get new parameters
if x ~= parameters then
parameters:copy(x)
end
-- reset gradients
gradParameters:zero()
-- f is the average of all criterions
local f = 0
-- evaluate function for complete mini batch
for i = 1,#inputs do
-- estimate f
local output = model:forward(inputs[i])
local err = criterion:forward(output, targets[i])
f = f + err
-- estimate df/dW
local df_do = criterion:backward(output, targets[i])
model:backward(inputs[i], df_do)
-- update confusion
confusion:add(output, targets[i])
end
-- normalize gradients and f(X)
gradParameters:div(#inputs)
f = f/#inputs
-- return f and df/dX
return f,gradParameters
end
I try to modify this function by suppressing the loop : for i = 1,#inputs do ... So instead of doing the forward and backward inputs by inputs (inputs[i]) I'm doing it for the whole mini batch (inputs). This really speed up the process. This is the modify script:
-- create closure to evaluate f(X) and df/dX
local feval = function(x)
-- get new parameters
if x ~= parameters then
parameters:copy(x)
end
-- reset gradients
gradParameters:zero()
-- f is the average of all criterions
local f = 0
-- evaluate function for complete mini batch
-- estimate f
local output = model:forward(inputs)
local f = criterion:forward(output, targets)
-- estimate df/dW
local df_do = criterion:backward(output, targets)
-- update weight
model:backward(inputs, df_do)
-- update confusion
confusion:batchAdd(output, targets)
-- return f and df/dX
return f,gradParameters
end
But when I check in detail the return of feval (f,gradParameters) for a given mini batch we haven't the same result with the loop and without loop.
So my questions are : 1 - Why do we have this loop ? 2 - And is it possible to get the same result without this loop ?
Regards Sam
NB: I'm beginner in Torch7
I'm sure you noticed getting the second way to work requires a bit more than simply changing feval. In your second example, inputs needs to be a 4D tensor, rather than a table of 3D tensors (unless something has changed since I last updated). These tensors have different sizes depending on the loss criterion/model used. Whoever implemented the example must have thought the loop was an easier way to go here. In addition, ClassNLLCriterion does not seem to like batch processing (one would usually use CrossEntropy criterion to get around this).
All of this aside though, the two methods should give the same result. The only slight difference is that the first example uses the average error/gradient, and the second uses the sum, as you can see from:
gradParameters:div(inputs:size(1))
f = f/inputs:size(1)
In the second case, f and gradParameters should differ from the first only in a factor opt.batchSize. These are mathematically equivalent for optimization purposes.