How to train model with multiple GPUs in pytorch？

My server has two GPUs, How can I use two GPUs for training at the same time to maximize their computing power? Is my code below correct? Does it allow my model to be properly trained?

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.bert = pretrained_model
        # for param in self.bert.parameters():
        #     param.requires_grad = True
        self.linear = nn.Linear(2048, 4)


    #def forward(self, input_ids, token_type_ids, attention_mask):
    def forward(self, input_ids, attention_mask):
        batch = input_ids.size(0)
        #output = self.bert(input_ids, token_type_ids, attention_mask).pooler_output
        output = self.bert(input_ids, attention_mask).last_hidden_state
        print('last_hidden_state',output.shape) # torch.Size([1, 768]) 
        #output = output.view(batch, -1) #
        output = output[:,-1,:]#(batch_size, hidden_size*2)(batch_size,1024)
        output = self.linear(output)
        return output

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.device_count() > 1:
    print("Use", torch.cuda.device_count(), 'gpus')
    model = MyModel()
    model = nn.DataParallel(model)
    model = model.to(device)

Solution

There are two different ways to train on multiple GPUs:

Data Parallelism = splitting a large batch that can't fit into a single GPU memory into multiple GPUs, so every GPU will process a small batch that can fit into its GPU
Model Parallelism = splitting the layers within the model into different devices is a bit tricky to manage and deal with.

Please refer to this post for more information

To do Data Parallelism in pure PyTorch, please refer to this example that I created a while back to the latest changes of PyTorch (as of today, 1.12).

To utilize other libraries to do multi-GPU training without engineering many things, I would suggest using PyTorch Lightning as it has a straightforward API and good documentation to learn how to do multi-GPU training using Data Parallelism.

Update: 2022/10/25

Here is a video explaining in much details about different types of distributed training: https://youtu.be/BPYOsDCZbno?t=1011