neural-network deep-learning backpropagation torch gradient-descent

How to write the updateGradInput and accGradParameters in torch?

I know the two functions are for torch's backward propagation and the interface is as follows updateGradInput(input, gradOutput) accGradParameters(input, gradOutput, scale) I'm confused about what the gradInput and gradOutput really mean in a layer. Assume the network's cost is C and a layer L. Do gradInput and gradOutput of layer L mean d_C/d_input_L and d_C/d_output_L?

If so, how to compute gradInput accorading to gradOutput?

Moreover, does accGradParameters mean to accumulate d_C/d_Weight_L and d_C/d_bias_L? If so, how to compute these values?

Solution

Do gradInput and gradOutput of layer L mean d_C/d_input_L and d_C/d_output_L

Yes:

gradInput = derivative of the cost w.r.t layer's input,
gradOutput = derivative of the cost w.r.t layer's output.

how to compute gradInput according to gradOutput

Adapting the schema from The building blocks of Deep Learning (warning: in this schema the cost is denoted L = Loss, and the layer f) we have:

For a concrete, step-by-step example of such a computation on a LogSoftMax layer you can refer to this answer.

does accGradParameters mean to accumulate d_C/d_Weight_L and d_C/d_bias_L

Yes. Named gradWeight and gradBias in torch/nn.

how to compute these values?

Similarly as above. Still using a formula from the above blog post:

Except the jacobian has not the same dimensionality (see the blog post for more details). As an example, for a Linear layer this translates into:

This is the outer product between the layer's input and gradOutput. In Torch we have:

self.gradWeight:addr(scale, gradOutput, input)

And:

Which is gradOutput. In Torch we have:

self.gradBias:add(scale, gradOutput)

In both cases scale is a scale factor used in practice as the learning rate.