To perform backward
in Pytorch, we can use an optional parameter y.backward(v)
to compute the Jacobian matrix multiplied by v
:
x = torch.randn(3, requires_grad=True)
y = x * 2
v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(v)
print(x.grad)
I think that costs the same to compute the Jacobian matrix, because each node in the AD graph which is necessary to compute the Jacobian matrix is still computed. So why not Pytorch doesn't want to give us the Jacobian matrix?
When you call backward() PyTorch udpdates the grad
of each learnable parameter with the gradient of some loss function L
w.r.t to that parameter. It has been designed with Gradient Descent [GD] (and its variants) in mind. Once the gradient has been computed you can update each parameter with x = x - learning_rate * x.grad
. Indeed in the background the Jacobian has to be computed but it is not what one needs (generally) when applying GD optimization. The vector [0.1, 1.0, 0.0001]
lets you reduce the output to a scalar so that x.grad will be a vector (and not a matrix, in case you do not reduce), and hence GD is well defined. You could, however, obtain the Jacobian using backward with one-hot vectors. For example, in this case:
x = torch.randn(3, requires_grad=True)
y = x * 2
J = torch.zeros(x.shape[0],x.shape[0])
for i in range(x.shape[0]):
v = torch.tensor([1 if j==i else 0 for j in range(x.shape[0])], dtype=torch.float)
y.backward(v, retain_graph=True)
J[:,i] = x.grad
x.grad.zero_()
print(J)