Search code examples
deep-learningpytorchdistributed-computing

What does local rank mean in distributed deep learning?


https://github.com/huggingface/transformers/blob/master/examples/run_glue.py

I want to adapt this script to do text classification on my data. The computer for this task is one single machine with two graphic cards. So this involves kind of "distributed" training with the term local_rank in the script above, especially when local_rank equals 0 or -1 like in line 83.

After reading some materials from distributed computation I guess that local_rank is like an ID for a machine. And 0 may mean this machine is the "main" or "head" in the computation. But what is -1?


Solution

  • Q: But what is -1?

    Usually, this is used to disable the distributed setting. Indeed, as you can see here:

    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    

    and here:

    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
                                                          output_device=args.local_rank,
                                                          find_unused_parameters=True)
    

    setting local_rank to -1 has this effect.