Search code examples
pythonmachine-learninglinear-regressiongradient-descent

Batch Gradient Descent with Python not converging


Here's the Jupyter Notebook I used for this practice: https://drive.google.com/file/d/18-OXyvXSit5x0ftiW9bhcqJrO_SE22_S/view?usp=sharing

I was practicing simple Linear Regression with this data set, and here are my parameters:

sat = np.array(data['SAT'])
gpa = np.array(data['GPA'])
theta_0 = 0.01
theta_1 = 0.01
alpha = 0.003
cost = 0
m = len(gpa)

I tried to optimize the cost function calculation by turning it into a matrix and perform element wise operations. This is the resulting formula I came up with:

Cost function optimization: Cost function optimization (image)

Cost function

def calculateCost(matrix_x,matrix_y,m):
    global theta_0,theta_1
    cost = (1 / (2 * m)) * ((theta_0 + (theta_1 * matrix_x) - matrix_y) ** 2).sum()
    return cost

I also tried to do the same for the gradient descent.

Gradient Descent

def gradDescent(alpha,matrix_x,matrix_y):
    global theta_0,theta_1,m,cost
    cost = calculateCost(sat,gpa,m)
    while cost > 1
        temp_0 = theta_0 - alpha * (1 / m) * (theta_0 + theta_1 * matrix_x - matrix_y).sum()
        temp_1 = theta_1 - alpha * (1 / m) * (matrix_x.transpose() * (theta_0 + theta_1 * matrix_x - matrix_y)).sum()
        theta_0 = temp_0
        theta_1 = temp_1

I am not entirely sure whether both implementations are correct. The implementation returned a cost of 114.89379821428574 and somehow this is how the "descent" looked like when I graph the costs:

Gradient descent graph:

Gradient descent graph

Please correct me if I have implemented both the cost function and gradient descent correctly, and provide explanation if possible as I am still a beginner in multivariable calculus. Thank you.


Solution

  • The are many issues with that code.

    First, the two main issues that are behind the bugs:

    1) The line

    temp_1 = theta_1 - alpha * (1 / m) * (matrix_x.transpose() * (theta_0 + theta_1 * matrix_x - matrix_y)).sum()
    

    specifically the matrix multiplication matrix_x.transpose() * (theta_0 + ...). The * operator does element-wise multiplication, and as a result, the result is of size 20x20, where you expect a gradient that is of size 1x1 (as you update a single real variable theta_1.

    2) The while cost>1: condition in your gradient computation. You never update the cost in the loop...

    Here is a version of your code that works:

    import numpy as np
    import matplotlib.pyplot as plt
    
    sat=np.random.rand(40,1)
    rand_a=np.random.randint(500)
    rand_b=np.random.randint(400)
    gpa=rand_a*sat+rand_b
    theta_0 = 0.01
    theta_1 = 0.01
    alpha = 0.1
    cost = 0
    m = len(gpa)
    
    def calculateCost(matrix_x,matrix_y,m):
        global theta_0,theta_1
        cost = (1 / 2 * m) * ((theta_0 + (theta_1 * matrix_x) - matrix_y) ** 2).sum()
        return cost
    
    def gradDescent(alpha,matrix_x,matrix_y,num_iter=10000,eps=0.5):
        global theta_0,theta_1,m,cost
        cost = calculateCost(sat,gpa,m)
        cost_hist=[cost]
        for i in range(num_iter):
            theta_0 -= alpha * (1 / m) * (theta_0 + theta_1 * matrix_x - matrix_y).sum()
            theta_1 -= alpha * (1 / m) * (matrix_x.transpose().dot(theta_0 + theta_1 * matrix_x - matrix_y)).sum()
            cost = calculateCost(sat,gpa,m)
            cost_hist.append(cost)
            if cost<eps:
                return cost_hist
    
    if __name__=="__main__":
    
        print("init_cost==",cost)
        cost_hist=gradDescent(alpha,sat,gpa)
        print("final_cost,num_iters",cost,len(cost_hist))
        print(rand_b,theta_0,rand_a,theta_1)
        plt.plot(cost_hist,linewidth=5,color="r");plt.show()
    

    Finally, the coding style itself, while not responsible for the bugs, is definitely an issue here. Generally, global variables are just bad practice. They just lead to bugprone, unmaintainable code. It's always better to store them in small data structures and pass them around to functions. In your case, you can just put the initial parameters in a list, pass them to your gradient computation function and return the optimized ones at the end.