Here is a toy model. I print the model parameters before calling backward
exactly once, then print the model parameters again. The parameters are unchanged. If I add the line model:updateParameters(<learning_rate>)
after calling backward
, I see the parameters update.
But in the example code I've seen, for example https://github.com/torch/demos/blob/master/train-a-digit-classifier/train-on-mnist.lua, no one actually calls updateParameters
. Also, it doesn't look like optim.sgd
, optim.adam
, or nn.StochasticGradient
ever call updateParameters
either. What am I missing here? How do the parameters get updated automatically? If I must call updateParameters
, why do no examples do that?
require 'nn'
require 'optim'
local model = nn.Sequential()
model:add(nn.Linear(4, 1, false))
local params, grads = model:getParameters()
local criterion = nn.MSECriterion()
local inputs = torch.randn(1, 4)
local labels = torch.Tensor{1}
print(params)
model:zeroGradParameters()
local output = model:forward(inputs)
local loss = criterion:forward(output, labels)
local dfdw = criterion:backward(output, labels)
model:backward(inputs, dfdw)
-- With the line below uncommented, the parameters are updated:
-- model:updateParameters(1000)
print(params)
The backward()
is not supposed to change parameters, it merely computes the derivatives of the error function with respect to all of the parameters of the network.
In general the training is the sequence of the steps:
repeat
local output = model:forward(input) --see what model predicts
local loss = criterion:forward(output, answer) --see how wrong it is
local loss_grad = criterion:backward(output, answer) --see where it is the most wrong
model:backward(input,loss_grad) --see how much each particular parameter of network is responsible for error
model:updateParameters(learningRate) --fix the parameters based on their wrongness
model:zeroGradParameters() --network parameters are different now, so old gradients are of no use now
until is_user_satisfied()
updateParameters
implements the most simple optimization algorithm here (gradient descent).
If so inclined, you may use your own function instead. In theory, you might perform explicit loops through the network storages to update their values.
In practice, you usually call getParameters()
local model_parameters,model_parameters_gradient=model:getParameters()
Which yields you homogeneous tensors of all the values and the gradients. These tensors are views inside the network, so changes in them affect the network. You may not know which point in the network corresponds to which value, but most optimizers do not care about that.
The demo of optim.sgd
usage is as follows:
optim.sgd(
function_to_return_error_and_its_gradients,
model_parameters,
optimizer_special_settings)
The specifics are covered in demo, but here it is relevant that optimizer receives the model_parameters
as a parameter which gives it write access to network. And it is not explicitly stated in the documentation, but in the source code it is seen, that the optimizer changes the values of its input tensor (also, note that it is returning the same tensor it received).