python optimization deep-learning pytorch

The running order of optimizers impacts predictions in PyTorch

I am running a linear regression for many optimizers. I noticed that if the SGD is activated first, the others have good accuracy. Otherwise, Adam and RMSprop present terrible adjustments.

# Generate synthetic data
X_numpy, y_numpy = datasets.make_regression(n_samples=100, n_features=1, noise=20, random_state=15)

X = torch.from_numpy(X_numpy.astype(np.float32))
y = torch.from_numpy(y_numpy.astype(np.float32))
y = y.view(y.shape[0], 1)

# Define the model
n_samples, n_features = X.shape
input_size = n_features
output_size = 1
model = nn.Linear(input_size, output_size)

# Define learning rate
learning_rate = 0.01

# Define criteria
criterion = nn.MSELoss()

# Define different optimizers
optimizers = {    
    "Adam": torch.optim.Adam(model.parameters(), lr=learning_rate),
    "RMSprop": torch.optim.RMSprop(model.parameters(), lr=learning_rate),
    "SGD": torch.optim.SGD(model.parameters(), lr=learning_rate),
}

# Training loop for each optimizer
num_epochs = 100
predictions = {}
for optimizer_name, optimizer in optimizers.items():
    print(f"Optimizer: {optimizer_name}")
    predictions[optimizer_name] = []
    for epoch in range(num_epochs):        
        y_predicted = model(X)
        loss = criterion(y_predicted, y)
        loss.backward()
        optimizer.step() #update wights
        optimizer.zero_grad() #zero the gradients
    predictions[optimizer_name] = model(X).detach().numpy()

# Plotting predictions with different colors
plt.figure(figsize=(10, 6))
plt.plot(X_numpy, y_numpy, 'ro', label='Original Data')
for optimizer_name, prediction in predictions.items():
    plt.plot(X_numpy, prediction, label=optimizer_name)
plt.legend()
plt.show()

The above code generates the predictions:

If I run SGD first, the following happens:

optimizers = {    
    "Adam": torch.optim.Adam(model.parameters(), lr=learning_rate),
    "RMSprop": torch.optim.RMSprop(model.parameters(), lr=learning_rate),
    "SGD": torch.optim.SGD(model.parameters(), lr=learning_rate),
}

Why does it happen?

Solution

The issue is that you keep the same model for all regressions, meaning that when the first optimization ends, you will proceed with the next one using a trained model. It happens that the learning rate is only working for SGD (it seems too large for the other two optimizers), so to summarize both cases:

if you start with the other two, the model will not perform well and the model won't fit the points. On the third training (SGD), the model will be trained properly and fit the points.
if you start with SGD it will train the model and the subsequent two trainings will not change the weights much leading to a similar performance

Instead, you should reset your model and define optimizer specific learning rates:

optimizers = dict(    
    RMSprop=(torch.optim.RMSprop, 0.3),
    SGD=(torch.optim.SGD, 0.01),
    Adam=(torch.optim.Adam, 0.5),
)

for optimizer_name, (klass,lr) in optimizers.items():
    model = copy.deepcopy(model_)
    optimizer = klass(model.parameters(), lr=lr)
    ## Proceed with your training loop

With the above setup, you will get the following fitting, regardless of the order of execution.