python machine-learning linear-regression gradient-descent stochastic-gradient

unexpected output with stochastic gradient descent algorithm for linear regression

I had an unexpected output while implementing SGD algorithm for my ML homework.

This is part of my training data which normally has 320 rows:

my dataset: https://github.com/Jangrae/csv/blob/master/carseats.csv

I first did some data preprocessing:

import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np

train_data = pd.read_csv('carseats_train.csv')
train_data.replace({'Yes': 1, 'No': 0}, inplace=True)
onehot_tr = pd.get_dummies(train_data['ShelveLoc'], dtype=int, prefix_sep='_', prefix='ShelveLoc')
train_data = train_data.drop('ShelveLoc', axis=1)
train_data = train_data.join(onehot_tr)


train_data_Y = train_data.iloc[:, 0]
train_data_X = train_data.drop('Sales', axis=1)

Then implemented the algorithm like this:

learning_rate = 0.01
epoch_num = 50
initial_w = 0.1
intercept = 0.1
w_matrix = np.ones((12, 1)) * initial_w

for e in range(epoch_num):
    for i in range(len(train_data_X)):

        x_i = train_data_X.iloc[i].to_numpy()
        y_i = train_data_Y.iloc[i]
        
        y_estimated = np.dot(x_i, w_matrix) + intercept
        
        grad_w = x_i.reshape(-1, 1) * (y_i - y_estimated)
    
        grad_intercept = (y_i - y_estimated)
        
       
        w_matrix = w_matrix - 2 * learning_rate * grad_w
        intercept = intercept - 2 * learning_rate * grad_intercept
        
        

print("Final weights:\n", w_matrix)
print("Final intercept:", intercept)

But the output was

Final weights:
 [[nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]]
Final intercept: [nan]

I run it with various learning rates and I also tried convergence threshold but still got the same result.. I couldn't find out why my code gives me nans..

Can anybody see the issue?

Solution

You get an overflow of numbers in your code. The gradients basically get too large with your setting. Consider taking more epochs and a much lower learning rate (aka. "step-size") to make your algorithm converge. I was able to get results with a learning rate of 0.000001, but you will have to see for your training set what the "correct" number could be and also monitor the convergence (depending on the number of epochs). You could also consider an adaptive learning rate schedule.

On another note: I am not exactly sure that your equations are correct. Since you use (y_i - y_estimated) and not the other way around, it might be that you need to update your weights and intercept with + (a "double minus", if you will). Maybe you can check that again. (For comparison: here or here)

PS: Your algorithm is not yet "stochastic". ;D