Search code examples
pytorchjoblibpytorch-lightningautograd

Pytorch's autograd issue with joblib


There seems to be a problem mixing pytorch's autograd with joblib. I need to get gradient in parallel for a lot of samples. Joblib works fine with other aspects of pytorch, however, when mixing with autograd it gives errors. I made a very small example which shows serial version works fine but the parallel version crashes.

from joblib import Parallel, delayed
import numpy as np
torch.autograd.set_detect_anomaly(True)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)

def Grad(X, Out):
    return autograd.grad(Out, X, create_graph=True, allow_unused=False)[0]

xs, ys = [], []
for i in range(10):
    xi = tt(np.random.rand()).float()    
    yi = xi * xi
    xs += [xi]
    ys += [yi]


Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2)([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)

The error message is not very helpful as well:

RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

Solution

  • Joblib is not copying the graph associated with the operations to the different process. One way to work around it is to perform the computation inside the job.

    import torch
    from torch import autograd
    from joblib import Parallel, delayed
    import numpy as np
    torch.autograd.set_detect_anomaly(False)
    tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)
    
    def Grad(X, Out):
        # This will compute yi in the job, and thus will
        # create the graph here
        yi = Out[0](*Out[1])
        # now the differentiation works
        return autograd.grad(yi, X, create_graph=True, allow_unused=False)[0]
    
    torch.set_num_threads(1)
    xs, ys = [], []
    for i in range(10):
        xi = tt(np.random.rand()).float()    
        yi = lambda xi: xi * xi, [xi]
        xs += [xi]
        ys += [yi]
    
    Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
    print("Grads_serial", Grads_serial)
    Grads_parallel = Parallel(n_jobs=2)([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
    print("Grads_parallel", Grads_parallel)
    

    Edit

    More philosophical questions are

    (1) does it make sense to use joblib parallelism, if you can simply vectorize your operations and let torch to use intraoperator parallelism?

    (2) mak14 mentioned using threading backend, it is good that it fixes your example. But multiple threads will use only one CPU, it makes sense for IO bounded jobs, like making HTTP requests, but not for CPU bounded operations.

    Edit #2

    The existence of torch.multiprocessing suggests that gradients require some special treatment, you could attempt to write a backend to joblib using torch.multiprocessing instead of multiprocessing or threading.

    Here you find an overview to how graphs are constructed in both frameworks

    https://www.tensorflow.org/guide/intro_to_graphs

    https://pytorch.org/blog/computational-graphs-constructed-in-pytorch/

    But I fear that to give a definite answer as to why one works and not the other will have to look into the implementation.