python machine-learning linear-regression gradient-descent

Batch Gradient Descent with Python not converging

Here's the Jupyter Notebook I used for this practice: https://drive.google.com/file/d/18-OXyvXSit5x0ftiW9bhcqJrO_SE22_S/view?usp=sharing

I was practicing simple Linear Regression with this data set, and here are my parameters:

sat = np.array(data['SAT'])
gpa = np.array(data['GPA'])
theta_0 = 0.01
theta_1 = 0.01
alpha = 0.003
cost = 0
m = len(gpa)

I tried to optimize the cost function calculation by turning it into a matrix and perform element wise operations. This is the resulting formula I came up with:

Cost function optimization: Cost function optimization (image)

Cost function

def calculateCost(matrix_x,matrix_y,m):
    global theta_0,theta_1
    cost = (1 / (2 * m)) * ((theta_0 + (theta_1 * matrix_x) - matrix_y) ** 2).sum()
    return cost

I also tried to do the same for the gradient descent.

Gradient Descent

def gradDescent(alpha,matrix_x,matrix_y):
    global theta_0,theta_1,m,cost
    cost = calculateCost(sat,gpa,m)
    while cost > 1
        temp_0 = theta_0 - alpha * (1 / m) * (theta_0 + theta_1 * matrix_x - matrix_y).sum()
        temp_1 = theta_1 - alpha * (1 / m) * (matrix_x.transpose() * (theta_0 + theta_1 * matrix_x - matrix_y)).sum()
        theta_0 = temp_0
        theta_1 = temp_1

I am not entirely sure whether both implementations are correct. The implementation returned a cost of 114.89379821428574 and somehow this is how the "descent" looked like when I graph the costs:

Gradient descent graph:

Gradient descent graph

Please correct me if I have implemented both the cost function and gradient descent correctly, and provide explanation if possible as I am still a beginner in multivariable calculus. Thank you.

Solution

The are many issues with that code.

First, the two main issues that are behind the bugs:

1) The line

temp_1 = theta_1 - alpha * (1 / m) * (matrix_x.transpose() * (theta_0 + theta_1 * matrix_x - matrix_y)).sum()

specifically the matrix multiplication matrix_x.transpose() * (theta_0 + ...). The * operator does element-wise multiplication, and as a result, the result is of size 20x20, where you expect a gradient that is of size 1x1 (as you update a single real variable theta_1.

2) The while cost>1: condition in your gradient computation. You never update the cost in the loop...

Here is a version of your code that works:

import numpy as np
import matplotlib.pyplot as plt

sat=np.random.rand(40,1)
rand_a=np.random.randint(500)
rand_b=np.random.randint(400)
gpa=rand_a*sat+rand_b
theta_0 = 0.01
theta_1 = 0.01
alpha = 0.1
cost = 0
m = len(gpa)

def calculateCost(matrix_x,matrix_y,m):
    global theta_0,theta_1
    cost = (1 / 2 * m) * ((theta_0 + (theta_1 * matrix_x) - matrix_y) ** 2).sum()
    return cost

def gradDescent(alpha,matrix_x,matrix_y,num_iter=10000,eps=0.5):
    global theta_0,theta_1,m,cost
    cost = calculateCost(sat,gpa,m)
    cost_hist=[cost]
    for i in range(num_iter):
        theta_0 -= alpha * (1 / m) * (theta_0 + theta_1 * matrix_x - matrix_y).sum()
        theta_1 -= alpha * (1 / m) * (matrix_x.transpose().dot(theta_0 + theta_1 * matrix_x - matrix_y)).sum()
        cost = calculateCost(sat,gpa,m)
        cost_hist.append(cost)
        if cost<eps:
            return cost_hist

if __name__=="__main__":

    print("init_cost==",cost)
    cost_hist=gradDescent(alpha,sat,gpa)
    print("final_cost,num_iters",cost,len(cost_hist))
    print(rand_b,theta_0,rand_a,theta_1)
    plt.plot(cost_hist,linewidth=5,color="r");plt.show()

Finally, the coding style itself, while not responsible for the bugs, is definitely an issue here. Generally, global variables are just bad practice. They just lead to bugprone, unmaintainable code. It's always better to store them in small data structures and pass them around to functions. In your case, you can just put the initial parameters in a list, pass them to your gradient computation function and return the optimized ones at the end.