I am aware that, while employing loss.backward()
we need to specify retain_graph=True
if there are multiple networks and multiple loss functions to optimize each network separately. But even with (or without) specifying this parameter I am getting errors. Following is an MWE to reproduce the issue (on PyTorch 1.6).
import torch
from torch import nn
from torch import optim
torch.autograd.set_detect_anomaly(True)
class GRU1(nn.Module):
def __init__(self):
super(GRU1, self).__init__()
self.brnn = nn.GRU(input_size=2, bidirectional=True, num_layers=1, hidden_size=100)
def forward(self, x):
return self.brnn(x)
class GRU2(nn.Module):
def __init__(self):
super(GRU2, self).__init__()
self.brnn = nn.GRU(input_size=200, bidirectional=True, num_layers=1, hidden_size=1)
def forward(self, x):
return self.brnn(x)
gru1 = GRU1()
gru2 = GRU2()
gru1_opt = optim.Adam(gru1.parameters())
gru2_opt = optim.Adam(gru2.parameters())
criterion = nn.MSELoss()
for i in range(100):
gru1_opt.zero_grad()
gru2_opt.zero_grad()
vector = torch.randn((15, 100, 2))
gru1_output, _ = gru1(vector) # (15, 100, 200)
loss_gru1 = criterion(gru1_output, torch.randn((15, 100, 200)))
loss_gru1.backward(retain_graph=True)
gru1_opt.step()
gru1_output, _ = gru1(vector) # (15, 100, 200)
gru2_output, _ = gru2(gru1_output) # (15, 100, 2)
loss_gru2 = criterion(gru2_output, torch.randn((15, 100, 2)))
loss_gru2.backward(retain_graph=True)
gru2_opt.step()
print(f"GRU1 loss: {loss_gru1.item()}, GRU2 loss: {loss_gru2.item()}")
With retain_graph
set to True
I get the error
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 300]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
The error without the parameter is
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.
which is expected.
Please point at what needs to be changed in the above code for it to begin training. Any help is appreciated.
In such a case, one can detach the computation graph to exclude the parameters that don't need to be optimized. In this case, the computation graph should be detached after the second forward pass with gru1
i.e.
....
gru1_opt.step()
gru1_output, _ = gru1(vector)
gru1_output = gru1_output.detach()
....
This way, you won't "try to backward through the graph a second time" as the error mentioned.