PyTorch gradient of loss (that depends on gradient of network) with respect to parameters

I'm trying to compute the gradient of my loss function with respect to my model parameters in PyTorch.

That is, let u(x; θ) be the model, where x is the input (in R^n) and θ are the model parameters. I'm trying to compute du/dθ.

For a "simple" loss function, this is not a problem, but my loss function depends on the gradient of the model with respect to its inputs (i.e., du/dx). When I attempt to do this, I'm met with the following error message: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

Here is a minimal example to illustrate the issue:

import torch
import torch.nn as nn
from torch.autograd import grad

model = nn.Sequential(nn.Linear(1, 10), nn.Tanh(), nn.Linear(10, 1))

def loss1(x, u):
    return torch.mean(u)

def loss2(x, u):
    d_u_x = grad(u, x, torch.ones_like(u), retain_graph=True, create_graph=True)[0]
    return torch.mean(d_u_x)

x = torch.randn(10, 1)
x.requires_grad_()
u = model(x)

loss = loss2(x, u)
d_loss_params = grad(loss, model.parameters(), retain_graph=True)

If I change the second to last line to read loss = loss1(x, u) things work as expected.

Update: it appears to be working if I set bias=False for the nn.Linears. OK, that makes some sense since the bias is not trainable. But that begs the question, how do I extract only the trainable parameters to use in the gradient computation?

Solution

This was solved by passing allow_unused=True and materialize_grads=True to grad. That is:

d_loss_params = grad(loss, model.parameters(), retain_graph=True, allow_unused=True, materialize_grads=True)

See discussion on https://discuss.pytorch.org/t/gradient-of-loss-that-depends-on-gradient-of-network-with-respect-to-parameters/217275 for more info.