I'm working on trying to compare the converge rate of SGD and GD algorithms for the neural networks. In PyTorch, we often use SGD optimizer as follows.
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
for epoch in range(epochs):
running_loss = 0
for input_batch, labels_batch in train_dataloader:
input = input_batch
y_hat = model(input)
y = labels_batch
L = loss(y_hat, y)
optimizer.zero_grad()
L.backward()
optimizer.step()
running_loss += L.item()
My understanding about the optimizer here is that the SGD optimizer actually does the Mini-batch Gradient Descent algorithm because we feed the optimizer one batch of data at one time. So, if we set the batch_size parameter as the size of all data, the code actually does Gradient Descent for the neural network.
Is my understanding correct?
Your understanding is correct. SGD is just updating weights based on the gradient computed by backpropagation. The flavor of gradient descent that it performs is therefore determined by the data loader.
shuffle=True
.shuffle=True
. By far the most common in practical applications.