Suppose I have a simple one-hidden-layer network that I'm training in the typical way:
for x,y in trainData:
optimizer.zero_grad()
out = self(x)
loss = self.lossfn(out, y)
loss.backward()
optimizer.step()
This works as expected, but if I instead pre-allocate and update the output array, I get an error:
out = torch.empty_like(trainData.tensors[1])
for i,(x,y) in enumerate(trainData):
optimizer.zero_grad()
out[i] = self(x)
loss = self.lossfn(out[i], y)
loss.backward()
optimizer.step()
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
What's happening here that in the second version Pytorch attempts to backward through the graph again? Why is this not an issue in the first version? (Note that this error occurs even if I don't zero_grad()
)
The error implies that the program is trying to backpropagate through a set of operations a second time. The first time you backpropagate through a set of operations, pytorch deletes the computational graph to free memory. Therefore, the second time you try to backpropagate it fails as the graph has already been deleted.
Here's a detailed explanation of the same.
Use loss.backward(retain_graph=True)
. This will not delete the computational graph.
In the first version, in each loop iteration, a new computational graph is generated every time out = self(x)
is run.
Every loop's graph
out = self(x) -> loss = self.lossfn(out, y)
In the second version, since out
is declared outside the loop, the computational graphs in every loop have a parent node outside.
- out[i] = self(x) -> loss = self.lossfn(out[i], y)
out[i] - | - out[i] = self(x) -> loss = self.lossfn(out[i], y)
- out[i] = self(x) -> loss = self.lossfn(out[i], y)
Therefore, here's a timeline of what happens.