Model takes twice the memory footprint with distributed data parallel

I have a model that trains just fine on a single GPU. But I'm getting CUDA memory errors when I switch to Pytorch distributed data parallel (DDP). Specifically, the DDP model takes up twice the memory footprint compared to the model with no parallelism. Here is a minimal reproducible example:

import os
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import torch.multiprocessing as mp
import torch

def train(rank, gpu_list, train_distributed):
    
    device_id = gpu_list[rank]

    model = torch.nn.Linear(1000, 1000)
    print(device_id, torch.cuda.memory_allocated(device_id))
    model.to(device_id)
    print(device_id, torch.cuda.memory_allocated(device_id))

    print(device_id, torch.cuda.memory_allocated(device_id))
    if train_distributed:
        # convert model to DDP
        dist.init_process_group("gloo", rank=rank, world_size=len(gpu_list))
        model = DDP(model, device_ids=[device_id], find_unused_parameters=False)
    print(device_id, torch.cuda.memory_allocated(device_id))

def train_distributed():
    gpu_list = [torch.device(i) for i in [5, 6]]
    os.environ['MASTER_ADDR'] = '127.0.01'
    os.environ['MASTER_PORT'] = '7676'
    mp.spawn(train, args=(gpu_list, True), nprocs=len(gpu_list), join=True)

if __name__ == '__main__':
    # First test one GPU
    train(0, [torch.device(5)], False)

    # Then test multiple GPUs
    train_distributed()

Output - note that the GPU usage doubles on both devices when switching to DDP:

cuda:5 0
cuda:5 4004352
cuda:5 4004352
cuda:5 4004352
cuda:5 0
cuda:6 0
cuda:5 4004352
cuda:5 4004352
cuda:6 4004352
cuda:6 4004352
cuda:5 8008704
cuda:6 8008704

Why does the model take up twice the space in DDP? Is it intended behavior? Is there a way to avoid this extra memory usage?

Solution

I'm adding here the solution of @ptrblck written in the PyTorch discussion forum.

Here're two quotes.

The statement:

[...] the allocated memory get doubled when torch.distributed.Reducer is instantiated in the constructor of DistributedDataParallel

And the answer:

[...] the Reducer will create gradient buckets for each parameter, so that the memory usage after wrapping the model into DDP will be 2x model_parameter_size. Note that the parameter size of a model is often much smaller than the activation size so that this memory increase might or might not be significant

So, from here we can see the reason why the memory footprint sometimes doubles.