python pytorch parallel-processing distributed

How to set random seed when it is in distributed training in PyTorch?

Now I am training a model using torch.distributed, but I am not sure how to set the random seeds. For example, this is my current code:

def main():
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    torch.cuda.manual_seed(args.seed)

    cudnn.enabled = True
    cudnn.benchmark = True
    cudnn.deterministic = True 

    mp.spawn(main_worker, nprocs=args.ngpus, args=(args,))

And should I move the

    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    torch.cuda.manual_seed(args.seed)

    cudnn.enabled = True
    cudnn.benchmark = True
    cudnn.deterministic = True

into the function main_worker() to make sure every process has the correct seed and cudnn settings? By the way, I have tried this, and this behavior will make the training 2 times slower, which really confused me.

Thank you very much for any help!

Solution

The spawned child processes do not inherit the seed you set manually in the parent process, therefore you need to set the seed in the main_worker function.

The same logic applies to cudnn.benchmark and cudnn.deterministic, so if you want to use these, you have to set them in main_worker as well. If you want to verify that, you can just print their values in each process.

cudnn.benchmark = True tries to find the optimal algorithm for your model, by benchmarking various implementations of certain operations (e.g. available convolution algorithms). This will take time to find the best algorithm, but once that is done, further iterations will potentially be faster. The algorithm that was determined to be the best, only applies to the specific input size that was used. If in the next iteration you have a different input size, the benchmark needs to be run again, in order to determine the best algorithm for that specific input size, which might be a different one than for the first input size.

I'm assuming that your input sizes vary, which would explain the slow down, as the benchmark wasn't used when it was set in the parent process. cudnn.benchmark = True should only be used if your input sizes are fixed.

cudnn.determinstic = True may also have a negative impact on the performance, because certain underlying operations, that are non-deterministic, need to be replaced with a deterministic version, which tend to be slower, otherwise the deterministic version would be used in the first place, but that performance impact shouldn't be too dramatic.