I have a model that trains just fine on a single GPU. But I'm getting CUDA memory errors when I switch to Pytorch distributed data parallel (DDP). Specifically, the DDP model takes up twice the memory footprint compared to the model with no parallelism. Here is a minimal reproducible example:
import os
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.multiprocessing as mp
import torch
def train(rank, gpu_list, train_distributed):
device_id = gpu_list[rank]
model = torch.nn.Linear(1000, 1000)
print(device_id, torch.cuda.memory_allocated(device_id))
model.to(device_id)
print(device_id, torch.cuda.memory_allocated(device_id))
print(device_id, torch.cuda.memory_allocated(device_id))
if train_distributed:
# convert model to DDP
dist.init_process_group("gloo", rank=rank, world_size=len(gpu_list))
model = DDP(model, device_ids=[device_id], find_unused_parameters=False)
print(device_id, torch.cuda.memory_allocated(device_id))
def train_distributed():
gpu_list = [torch.device(i) for i in [5, 6]]
os.environ['MASTER_ADDR'] = '127.0.01'
os.environ['MASTER_PORT'] = '7676'
mp.spawn(train, args=(gpu_list, True), nprocs=len(gpu_list), join=True)
if __name__ == '__main__':
# First test one GPU
train(0, [torch.device(5)], False)
# Then test multiple GPUs
train_distributed()
Output - note that the GPU usage doubles on both devices when switching to DDP:
cuda:5 0
cuda:5 4004352
cuda:5 4004352
cuda:5 4004352
cuda:5 0
cuda:6 0
cuda:5 4004352
cuda:5 4004352
cuda:6 4004352
cuda:6 4004352
cuda:5 8008704
cuda:6 8008704
Why does the model take up twice the space in DDP? Is it intended behavior? Is there a way to avoid this extra memory usage?
I'm adding here the solution of @ptrblck written in the PyTorch discussion forum.
Here're two quotes.
The statement:
[...] the allocated memory get doubled when
torch.distributed.Reducer
is instantiated in the constructor ofDistributedDataParallel
And the answer:
[...] the
Reducer
will create gradient buckets for each parameter, so that the memory usage after wrapping the model intoDDP
will be 2xmodel_parameter_size
. Note that the parameter size of a model is often much smaller than the activation size so that this memory increase might or might not be significant
So, from here we can see the reason why the memory footprint sometimes doubles.