I know the two functions are for torch's backward propagation and the interface is as follows
updateGradInput(input, gradOutput)
accGradParameters(input, gradOutput, scale)
I'm confused about what the gradInput
and gradOutput
really mean in a layer.
Assume the network's cost is C
and a layer L
. Do gradInput
and gradOutput
of layer L
mean d_C/d_input_L
and d_C/d_output_L
?
If so, how to compute gradInput
accorading to gradOutput
?
Moreover, does accGradParameters
mean to accumulate d_C/d_Weight_L
and d_C/d_bias_L
? If so, how to compute these values?
Do
gradInput
andgradOutput
of layer L meand_C/d_input_L
andd_C/d_output_L
Yes:
gradInput
= derivative of the cost w.r.t layer's input,gradOutput
= derivative of the cost w.r.t layer's output.how to compute
gradInput
according togradOutput
Adapting the schema from The building blocks of Deep Learning (warning: in this schema the cost is denoted L
= Loss
, and the layer f
) we have:
For a concrete, step-by-step example of such a computation on a LogSoftMax layer you can refer to this answer.
does
accGradParameters
mean to accumulated_C/d_Weight_L
andd_C/d_bias_L
Yes. Named gradWeight
and gradBias
in torch/nn.
how to compute these values?
Similarly as above. Still using a formula from the above blog post:
Except the jacobian has not the same dimensionality (see the blog post for more details). As an example, for a Linear layer this translates into:
This is the outer product between the layer's input and gradOutput
. In Torch we have:
self.gradWeight:addr(scale, gradOutput, input)
And:
Which is gradOutput
. In Torch we have:
self.gradBias:add(scale, gradOutput)
In both cases scale
is a scale factor used in practice as the learning rate.