I have been trying to implement a linear regression model with gradient descent (from scratch). In doing so, I have reached a point where increasing the number of iterations my code runs for leads to an obviously worse solution with a supposedly lower cost. Kindly help me.
Here is the data-set I am using to train the model: https://www.kaggle.com/datasets/andonians/random-linear-regression (Train.csv)
Importing Python Libraries:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns #data visualisation (seaborn)
import matplotlib.pyplot as plt
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
I am importing the data as follows:
data = pd.read_csv("../input/random-linear-regression/train.csv")
data.head()
Here is the Linear Regression class:
class LinearRegression() :
def __init__ (self, iterations = 100, learningrate=0.0001):
self.iterations = iterations
self.learningrate = learningrate
def gradientDescent(self,x,y,threshold=1e-9):
x = (x - x.mean())/x.std() #Not sure if this is needed or not
y = (y-y.mean())/y.std() #Not sure if needed or not
#adding col of 1s to accommodate bias (theta0)
x = np.c_[np.ones(x.shape[0]),x] # m x n matrix
#Cleaning data to get rid of NAN and replacing with 0
y = np.nan_to_num(y)
#m = number of samples
#n = number of features
m = x.shape[0]
n = x.shape[1]
#weights
theta = np.ones(n) # n x 1 vector
past_thetas = [0]
past_costs = []
for i in range(self.iterations):
#hypothesis
h = np.dot(x,theta) # m x 1 vector
error = np.subtract(h,y) # m x 1
errorsq = np.square(error)
cost = 1/(2 * m) * np.sum(errorsq)
past_costs.append(cost)
delta = (1/m) * np.dot(x.T,error)
theta = theta - (self.learningrate*delta)
past_thetas.append(theta)
print("Weights:", theta)
return past_thetas,past_costs
Here is the main function:
def main():
past_thetas = []
past_costs = []
regressor = LinearRegression()
past_thetas, past_costs = regressor.gradientDescent(x=data['x'],y=data['y'])
print(past_costs[-1])
plt.figure(figsize=(10,10))
sns.scatterplot(data=data, x='x', y='y')
x = np.linspace(0,100,5)
y = past_thetas[-1][0] + (past_thetas[-1][1]*x)
plt.plot(x,y,'b')
sns.scatterplot(data=data, x='x', y='y')
plt.figure (figsize = (5,5))
plt.plot(past_costs)
if __name__ == "__main__":
main()
The problem:
If I run this with Learning Rate = 0.0001 & Iterations = 100, I get the following results:
However, if I keep the learning rate constant but change the iterations to 1000 or 10000, the line keeps drifting away from the points and while doing so, it computes a cost lower than an obviously more accurate fit. Here is the result with LR = 0.0001 & iterations = 10000:
Here are the cost function plots in both cases as well:
,
How do I fix this? As per my understanding, more iterations should only make the convergence slower but it should not overshoot the local minimum. The code gives similar results with an even lower learning rate due to which I am fairly sure that the learning rate is not an issue (although please correct me if I am wrong about that)
There are two things happening there:
#Not sure if this is needed or not
can be removed.x=3500
, y=nan
. Due to the way you were cleaning the nan
s (by setting them to 0), you would end up with a point in (3500, 0)
.Because of (1), you are fitting a line with a slope that may be a lot smaller than your data (in your test case it would work because the slope is close to 1, and the intersect is also small). Because of (2), your fit is being very skewed by that outlier. The point did not appear on your plots because points with nan
s are usually not displayed.
Additionally, I suggested some changes in your code to iterate until the cost stops changing, up to a maximum number of iterations, and throwing an exception if a nan
is in any of the inputs.
# No need for this to be a class
def linearRegression(x, y, threshold=1e-6, learningrate=0.00001, max_iter=1_000_000):
# Check for nans and explode if they are here
if np.any(np.isnan(x)) or np.any(np.isnan(y)):
raise Exception('There are nans in the input data, please clean it')
# adding col of 1s to accommodate bias (theta0)
x = np.c_[np.ones(x.shape[0]), x] # m x n matrix
m = x.shape[0] #m = number of samples
n = x.shape[1] #n = number of features
# weights
theta = np.ones(n) # n x 1 vector
past_thetas = [theta]
past_costs = []
i = 0
while i < max_iter:
i += 1
hypothesis = np.dot(x, theta) # m x 1 vector
error = np.subtract(hypothesis, y) # m x 1
cost = 1/(2 * m) * np.dot(error, error.T)
delta = (1/m) * np.dot(x.T, error)
theta = theta - learningrate*delta
if past_costs and (abs(cost - past_costs[-1]) < threshold):
past_costs.append(cost)
break
past_costs.append(cost) # Not a great repetition, but it works
past_thetas.append(theta)
if i == max_iter:
print('Hit max iterations without convergence')
print("Weights:", theta)
return past_thetas, past_costs
def main():
data = pd.read_csv("train.csv")
clean_data = data.dropna() # Remove the nans from the regression
past_thetas, past_costs = linearRegression(x=clean_data['x'], y=clean_data['y'])
print(past_costs[-1])
print(f'Converged in {len(past_costs)} iterations')
plt.figure(figsize=(10,10)); plt.grid()
sns.scatterplot(data=clean_data, x='x', y='y')
x = clean_data['x']
y = past_thetas[-1][0] + (past_thetas[-1][1]*x)
plt.plot(x,y,'b')
plt.figure(figsize=(10,10)); plt.grid()
sns.scatterplot(data=data, x='x', y='y')
x = data['x'] # Now you use your interpolation to find the data that should be
y = past_thetas[-1][0] + (past_thetas[-1][1]*x)
plt.plot(x,y,'b')
sns.scatterplot(data=data, x='x', y='y')
plt.figure(figsize = (5,5)); plt.grid()
plt.plot(past_costs)
plt.show()
This converges in 175 iterations with a reasonably low cost:
Weights: [0.99927641 0.98412022]
4.087396131056722
Converged in 175 iterations