DeepSpeed Lightning refusing to parallelize layers even when setting to stage 3

I want to come up with a very simple Lightning example using DeepSpeed, but it refused to parallelize layers even when setting to stage 3.

I'm just blowing up the model by adding FC layers in the hope they get distributed to the different GPU (6 in total)

But I'm ending up with

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 3; 15.00 GiB total capacity; 14.00 GiB already allocated; 5.25 MiB free; 14.00 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Therefore I guess the layers are only put to a single GPU.

The full code is available here, but this is a gist of it:

Blowing up the model with 18000 layers:

class TelModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        embed_dim = 512
        component_list = [
                nn.Linear(512, embed_dim)
        #] + [nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True) for _ in range(n_layers)] + [
        ] + [nn.Linear(embed_dim, 512) for _ in range(n_layers)] + [
                nn.Linear(embed_dim, 512)
        ]
        self.net = torch.nn.Sequential(*component_list)

Initializing DeepSpeed:

tel_model = TelModel()
train_ds = RandomDataset(100)
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE)
trainer = L.Trainer(accelerator="gpu", devices=6, strategy="deepspeed_stage_3", precision=32)
trainer.fit(tel_model, train_loader)

And finally, I run it like this:

deepspeed lightning-deepspeed-tel.py

Solution

The batch size is batch size per device. The CUDA OOM error is most likely because a batch size of 256 is too big. Trying a smaller batch size like 32 or 64 will solve the issue. The effective batch size of your code will be batch_size_per_device x num_nodes x num_gpus