Question summary: How is the dimensionality of inputs and outputs handled in the backward pass of custom functions?
According to the manual, the basic structure of custom functions is the following:
class MyFunc(torch.autograd.Function):
@staticmethod
def forward(ctx, input): # f(x) = e^x
result = input.exp()
ctx.save_for_backward(result)
return result
@staticmethod
def backward(ctx, grad_output): # df(x) = e^x
result, = ctx.saved_tensors
return grad_output * result
For a single input and output dimension, this is perfectly fine and works like a charm. But for higher dimensions the backward pass becomes confusing. Apparently, PyTorch only accepts a result of backward
that has the same dimensionality as the result of forward
(for the same input). Returning a wrong shape yields a RuntimeError: Function MyFunc returned an invalid gradient at index 0 - got [*] but expected shape compatible with [*]
. So I am wondering: What does backward actually compute?
Its not a Jacobian? For example, when I have a function f(x) = ( f_1(x_1, ... , x_n), ... , f_k(x_1, ... , x_n) )
with n
inputs and k
outputs, I would expect that a gradient calculation would yield a Jacobian matrix of dimension k*n
. However, the PyTorch implementation expects just a vector of dimension n
. So what does the backward result actually mean, it can't be the Jacobian?
And it does not handle batches? Moreover, what if I would like to push a batch of input vectors through this function, e.g. an input of dimension b*n
with batch size b
. Then, instead of something like b*k*n
the gradient is expected to also have the shape b*n
. Is it even intended to consider the processing of batches with custom functions?
None of these questions seems to be addressed in the manual and the provided examples are very simple, which does not help at all. Maybe there are formulas hidden somewhere that explain the background of the provided Function
interface in more detail, but I haven't found them yet.
It does not store/return the Jacobian (I imagine it is related to memory consideration).
From a training perspective, we do not need the Jacobian for updating parameters/back-propagating further.
For updating parameters, all we need is dL/dy_j
, j<n
:
y_j -= alpha * dL/dy_j
And for backpropagation to z
, say z=f(y)=f(g(x))
:
dL/dz_k = dL/dy_j * dy_j/dz_k
One may say that "but we need dy_j/dz_k
here!" -- it is true, but we do not need to store it (just like we do not use the Jacobian of dx_i/dy_j
at all in this step).
In other words, the Jacobian is only implicitly used, is not required for the most part, and is therefore do away for the sake of memory.
And for the batch part, note that mini-batch learning mostly just averages the gradient. PyTorch expects you to handle it in the backward function (again, such that the function returns at little as possible and saves as much memory as possible).
Note: One can "gather" the Jacobian and obtain the n
-sized vector that you have mentioned. Specifically, sum over the k
dimension and average over the batch dimension.
EDIT: Not 100% sure, but I think the backward call (of f(x)=y) is expected to return this vector:
where \nabla x
is the input argument to backward
.