pytorch joblib pytorch-lightning autograd

Pytorch's autograd issue with joblib

There seems to be a problem mixing pytorch's autograd with joblib. I need to get gradient in parallel for a lot of samples. Joblib works fine with other aspects of pytorch, however, when mixing with autograd it gives errors. I made a very small example which shows serial version works fine but the parallel version crashes.

from joblib import Parallel, delayed
import numpy as np
torch.autograd.set_detect_anomaly(True)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)

def Grad(X, Out):
    return autograd.grad(Out, X, create_graph=True, allow_unused=False)[0]

xs, ys = [], []
for i in range(10):
    xi = tt(np.random.rand()).float()    
    yi = xi * xi
    xs += [xi]
    ys += [yi]


Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2)([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)

The error message is not very helpful as well:

RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

Solution

Joblib is not copying the graph associated with the operations to the different process. One way to work around it is to perform the computation inside the job.

import torch
from torch import autograd
from joblib import Parallel, delayed
import numpy as np
torch.autograd.set_detect_anomaly(False)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)

def Grad(X, Out):
    # This will compute yi in the job, and thus will
    # create the graph here
    yi = Out[0](*Out[1])
    # now the differentiation works
    return autograd.grad(yi, X, create_graph=True, allow_unused=False)[0]

torch.set_num_threads(1)
xs, ys = [], []
for i in range(10):
    xi = tt(np.random.rand()).float()    
    yi = lambda xi: xi * xi, [xi]
    xs += [xi]
    ys += [yi]

Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2)([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)

Edit

Edit #2

The existence of torch.multiprocessing suggests that gradients require some special treatment, you could attempt to write a backend to joblib using torch.multiprocessing instead of multiprocessing or threading.

Here you find an overview to how graphs are constructed in both frameworks

https://www.tensorflow.org/guide/intro_to_graphs

https://pytorch.org/blog/computational-graphs-constructed-in-pytorch/

But I fear that to give a definite answer as to why one works and not the other will have to look into the implementation.