python numpy machine-learning linear-regression gradient-descent

Why does the cost in my Linear Regression function increase if I increase the number of iterations?

I have been trying to implement a linear regression model with gradient descent (from scratch). In doing so, I have reached a point where increasing the number of iterations my code runs for leads to an obviously worse solution with a supposedly lower cost. Kindly help me.

Here is the data-set I am using to train the model: https://www.kaggle.com/datasets/andonians/random-linear-regression (Train.csv)

Importing Python Libraries:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns #data visualisation (seaborn)
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

I am importing the data as follows:

data = pd.read_csv("../input/random-linear-regression/train.csv")
data.head()

Here is the Linear Regression class:

class LinearRegression() :
    def __init__ (self, iterations = 100, learningrate=0.0001):
        self.iterations = iterations
        self.learningrate = learningrate
    
    def gradientDescent(self,x,y,threshold=1e-9):
        
        x = (x - x.mean())/x.std() #Not sure if this is needed or not
        y = (y-y.mean())/y.std() #Not sure if needed or not
        
        #adding col of 1s to accommodate bias (theta0)        
        x = np.c_[np.ones(x.shape[0]),x] # m x n matrix

        
        #Cleaning data to get rid of NAN and replacing with 0 
        y = np.nan_to_num(y)
        
        #m = number of samples
        #n = number of features
        m = x.shape[0]
        n = x.shape[1]
                
        #weights
        theta = np.ones(n) # n x 1 vector
        
        
        past_thetas = [0]
        past_costs = []
       
        for i in range(self.iterations):
            #hypothesis
            h = np.dot(x,theta) # m x 1 vector
            
            error = np.subtract(h,y) # m x 1   
            errorsq = np.square(error)
            
            cost = 1/(2 * m) * np.sum(errorsq)
            past_costs.append(cost)

            delta = (1/m) * np.dot(x.T,error)
            theta = theta - (self.learningrate*delta)
            past_thetas.append(theta)
            
        print("Weights:", theta)
        return past_thetas,past_costs

Here is the main function:

def main():
    past_thetas = []
    past_costs = []    
    regressor = LinearRegression()
    past_thetas, past_costs = regressor.gradientDescent(x=data['x'],y=data['y'])
    print(past_costs[-1])
    
    plt.figure(figsize=(10,10))
    sns.scatterplot(data=data, x='x', y='y')
    x = np.linspace(0,100,5)
    y = past_thetas[-1][0] + (past_thetas[-1][1]*x)
    plt.plot(x,y,'b')
    sns.scatterplot(data=data, x='x', y='y')
    
    plt.figure (figsize = (5,5)) 
    plt.plot(past_costs)

if __name__ == "__main__":
    main()

The problem: If I run this with Learning Rate = 0.0001 & Iterations = 100, I get the following results: Model plot for 100 iterations

However, if I keep the learning rate constant but change the iterations to 1000 or 10000, the line keeps drifting away from the points and while doing so, it computes a cost lower than an obviously more accurate fit. Here is the result with LR = 0.0001 & iterations = 10000: Model plot for 10000 iterations

Here are the cost function plots in both cases as well: Cost function plot for # of iterations = 100 ,

How do I fix this? As per my understanding, more iterations should only make the convergence slower but it should not overshoot the local minimum. The code gives similar results with an even lower learning rate due to which I am fairly sure that the learning rate is not an issue (although please correct me if I am wrong about that)

Solution

There are two things happening there:

You are normalising the data but not "denormalising" it. The two lines you commented #Not sure if this is needed or not can be removed.
Your input data has a point x=3500, y=nan. Due to the way you were cleaning the nans (by setting them to 0), you would end up with a point in (3500, 0).

Because of (1), you are fitting a line with a slope that may be a lot smaller than your data (in your test case it would work because the slope is close to 1, and the intersect is also small). Because of (2), your fit is being very skewed by that outlier. The point did not appear on your plots because points with nans are usually not displayed.

Additionally, I suggested some changes in your code to iterate until the cost stops changing, up to a maximum number of iterations, and throwing an exception if a nan is in any of the inputs.

# No need for this to be a class
def linearRegression(x, y, threshold=1e-6, learningrate=0.00001, max_iter=1_000_000):
    # Check for nans and explode if they are here
    if np.any(np.isnan(x)) or np.any(np.isnan(y)):
        raise Exception('There are nans in the input data, please clean it')

    # adding col of 1s to accommodate bias (theta0)
    x = np.c_[np.ones(x.shape[0]), x] # m x n matrix

    m = x.shape[0] #m = number of samples
    n = x.shape[1] #n = number of features

    # weights
    theta = np.ones(n) # n x 1 vector

    past_thetas = [theta]
    past_costs = []

    i = 0
    while i < max_iter:
        i += 1
        hypothesis = np.dot(x, theta)  # m x 1 vector
        error = np.subtract(hypothesis, y)  # m x 1

        cost = 1/(2 * m) * np.dot(error, error.T)
        delta = (1/m) * np.dot(x.T, error)
        theta = theta - learningrate*delta

        if past_costs and (abs(cost - past_costs[-1]) < threshold):
            past_costs.append(cost)
            break
        past_costs.append(cost)  # Not a great repetition, but it works
        past_thetas.append(theta)

    if i == max_iter:
        print('Hit max iterations without convergence')

    print("Weights:", theta)
    return past_thetas, past_costs


def main():
    data = pd.read_csv("train.csv")
    clean_data = data.dropna()  # Remove the nans from the regression
    past_thetas, past_costs = linearRegression(x=clean_data['x'], y=clean_data['y'])
    print(past_costs[-1])
    print(f'Converged in {len(past_costs)} iterations')

    plt.figure(figsize=(10,10)); plt.grid()
    sns.scatterplot(data=clean_data, x='x', y='y')
    x = clean_data['x']
    y = past_thetas[-1][0] + (past_thetas[-1][1]*x)
    plt.plot(x,y,'b')

    plt.figure(figsize=(10,10)); plt.grid()
    sns.scatterplot(data=data, x='x', y='y')
    x = data['x']  # Now you use your interpolation to find the data that should be
    y = past_thetas[-1][0] + (past_thetas[-1][1]*x)
    plt.plot(x,y,'b')
    sns.scatterplot(data=data, x='x', y='y')

    plt.figure(figsize = (5,5)); plt.grid()
    plt.plot(past_costs)
    plt.show()

This converges in 175 iterations with a reasonably low cost:

Weights: [0.99927641 0.98412022]
4.087396131056722
Converged in 175 iterations