Search code examples
pythonoptimizationpytorchbackpropagationautomatic-differentiation

Why Pytorch autograd need another vector to backward instead of computing Jacobian?


To perform backward in Pytorch, we can use an optional parameter y.backward(v) to compute the Jacobian matrix multiplied by v:

x = torch.randn(3, requires_grad=True)
y = x * 2

v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(v)

print(x.grad)

I think that costs the same to compute the Jacobian matrix, because each node in the AD graph which is necessary to compute the Jacobian matrix is still computed. So why not Pytorch doesn't want to give us the Jacobian matrix?


Solution

  • When you call backward() PyTorch udpdates the grad of each learnable parameter with the gradient of some loss function L w.r.t to that parameter. It has been designed with Gradient Descent [GD] (and its variants) in mind. Once the gradient has been computed you can update each parameter with x = x - learning_rate * x.grad. Indeed in the background the Jacobian has to be computed but it is not what one needs (generally) when applying GD optimization. The vector [0.1, 1.0, 0.0001] lets you reduce the output to a scalar so that x.grad will be a vector (and not a matrix, in case you do not reduce), and hence GD is well defined. You could, however, obtain the Jacobian using backward with one-hot vectors. For example, in this case:

    x = torch.randn(3, requires_grad=True)
    y = x * 2
    J = torch.zeros(x.shape[0],x.shape[0])
    for i in range(x.shape[0]):
        v = torch.tensor([1 if j==i else 0 for j in range(x.shape[0])], dtype=torch.float)
        y.backward(v, retain_graph=True)
        J[:,i] = x.grad
        x.grad.zero_()
    print(J)