Search code examples
pythonpytorchgradient-descent

Is SGD optimizer in PyTorch actually does Gradient Descent algorithm?


I'm working on trying to compare the converge rate of SGD and GD algorithms for the neural networks. In PyTorch, we often use SGD optimizer as follows.

train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
for epoch in range(epochs):
    running_loss = 0

    for input_batch, labels_batch in train_dataloader:

        input = input_batch

        y_hat = model(input)
        y = labels_batch
        L = loss(y_hat, y)
        
        optimizer.zero_grad()
        L.backward()
        
        optimizer.step()
        
        running_loss += L.item()

My understanding about the optimizer here is that the SGD optimizer actually does the Mini-batch Gradient Descent algorithm because we feed the optimizer one batch of data at one time. So, if we set the batch_size parameter as the size of all data, the code actually does Gradient Descent for the neural network.

Is my understanding correct?


Solution

  • Your understanding is correct. SGD is just updating weights based on the gradient computed by backpropagation. The flavor of gradient descent that it performs is therefore determined by the data loader.

    • Gradient descent (aka batch gradient descent): Batch size equal to the size of the entire training dataset.
    • Stochastic gradient descent: Batch size equal to one and shuffle=True.
    • Mini-batch gradient descent: Any other batch size and shuffle=True. By far the most common in practical applications.