There seems to be a problem mixing pytorch's autograd with joblib. I need to get gradient in parallel for a lot of samples. Joblib works fine with other aspects of pytorch, however, when mixing with autograd it gives errors. I made a very small example which shows serial version works fine but the parallel version crashes.
from joblib import Parallel, delayed
import numpy as np
torch.autograd.set_detect_anomaly(True)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)
def Grad(X, Out):
return autograd.grad(Out, X, create_graph=True, allow_unused=False)[0]
xs, ys = [], []
for i in range(10):
xi = tt(np.random.rand()).float()
yi = xi * xi
xs += [xi]
ys += [yi]
Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2)([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)
The error message is not very helpful as well:
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.
Joblib is not copying the graph associated with the operations to the different process. One way to work around it is to perform the computation inside the job.
import torch
from torch import autograd
from joblib import Parallel, delayed
import numpy as np
torch.autograd.set_detect_anomaly(False)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)
def Grad(X, Out):
# This will compute yi in the job, and thus will
# create the graph here
yi = Out[0](*Out[1])
# now the differentiation works
return autograd.grad(yi, X, create_graph=True, allow_unused=False)[0]
torch.set_num_threads(1)
xs, ys = [], []
for i in range(10):
xi = tt(np.random.rand()).float()
yi = lambda xi: xi * xi, [xi]
xs += [xi]
ys += [yi]
Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2)([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)
More philosophical questions are
(1) does it make sense to use joblib parallelism, if you can simply vectorize your operations and let torch to use intraoperator parallelism?
(2) mak14 mentioned using threading backend, it is good that it fixes your example. But multiple threads will use only one CPU, it makes sense for IO bounded jobs, like making HTTP requests, but not for CPU bounded operations.
The existence of torch.multiprocessing suggests that gradients require some special treatment, you could attempt to write a backend to joblib using torch.multiprocessing
instead of multiprocessing
or threading
.
Here you find an overview to how graphs are constructed in both frameworks
https://www.tensorflow.org/guide/intro_to_graphs
https://pytorch.org/blog/computational-graphs-constructed-in-pytorch/
But I fear that to give a definite answer as to why one works and not the other will have to look into the implementation.