Search code examples
pythonlinear-regressiongradientgradient-descent

Can't figure out gradient descent linear regression


I'm currently working on gradient descent projects.

I chose nba stats as my data so I downloaded 3Pts data and pts data from basketball reference and I have successfully plotted a scatter plot. However, the result does not seem right.

My scatter plot is going towards right-upside (since more 3points made generally means more points scored, so it makes sense)

But my gradient descent line is going to left-upside, I don't know what's wrong.

import pandas as pd
import numpy as np
from sklearn import linear_model
from matplotlib import pyplot as plt


data = pd.read_csv('C:/Users/jeehw/Documents/FG3M_PTS_2021.csv')


X = data.iloc[:,1]
Y = data.iloc[:,2]

plt.figure(figsize=(8,6))
plt.xlabel('FG3M')                                  
plt.ylabel('PTS')
plt.scatter(X,Y)
plt.show()

m = 0
c = 0

L = 0.001
epochs = 200

n = float(len(X))

for i in range(len(X)):
Y_pred = m*X + c
m_Grad = (1/n) * sum(X * (Y_pred - Y))
c_Grad = (1/n) * sum(Y_pred - Y)


m = m - L* m_Grad
c = c - L* c_Grad

Y_pred = m*X + c

plt.scatter(X, Y)
plt.scatter(X, Y_pred)
plt.show()

Solution

  • A few things in this code don't really make sense. Are you trying to do the regression from scratch? Because you do import scikit learn but never apply it. You can refer to this link on how to use scikit learn regression here. I would also consider playing around with other algorithms too.

    I believe this is what you are trying to do here:

    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn import linear_model
    from sklearn.metrics import mean_squared_error, r2_score
    from matplotlib import pyplot as plt
    
    
    #data = pd.read_csv('C:/Users/jeehw/Documents/FG3M_PTS_2021.csv')
    raw_data = pd.read_html('https://www.basketball-reference.com/leagues/NBA_2021_totals.html')[0]
    raw_data  = raw_data[raw_data['Rk'].ne('Rk')]
    
    data = raw_data[['Player','3P','PTS']]
    data[['3P','PTS']] = data[['3P','PTS']].astype(int)
    
    X = data.iloc[:]['3P'].values
    y = data.iloc[:]['PTS'].values
    
    plt.figure(figsize=(8,6))
    plt.xlabel('FG3M')                                  
    plt.ylabel('PTS')
    plt.scatter(X,y)
    
    plt.xticks(np.arange(min(X), max(X)+1, 20))
    plt.yticks(np.arange(min(y), max(y)+1, 100))
    plt.show()
    
    
    # Split data into test and Train
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
    
    # Create linear regression object
    regr = linear_model.LinearRegression()
    
    # Train the model using the training sets
    regr.fit(X_train.reshape(-1, 1), y_train)
    
    # Make predictions using the testing set
    y_pred = regr.predict(X_test.reshape(-1, 1))
    
    
    
    # The coefficients
    print('Coefficients: \n', regr.coef_)
    # The mean squared error
    print('Mean squared error: %.2f'
          % mean_squared_error(y_test, y_pred))
    # The coefficient of determination: 1 is perfect prediction
    print('Coefficient of determination: %.2f'
          % r2_score(y_test, y_pred))
    
    # Plot outputs
    plt.scatter(X_test, y_test,  color='black')
    plt.plot(X_test, y_pred, color='red', linewidth=3)
    
    plt.xticks(np.arange(min(X_test), max(X_test)+1, 20))
    plt.yticks(np.arange(min(y_pred), max(y_pred)+1, 100))
    
    plt.xlabel('FG3M')                                  
    plt.ylabel('PTS')
    
    plt.show()
    

    enter image description here

    enter image description here

    There is some noise in there. You have a lot of players who score a lot of points, who NEVER shoot a 3pt, let alone make one. So I would consider do a good amount of data cleaning first (maybe take only players who've had at least 50 3 point attempts? Or Get rid of Centers? Also if a player changes teams, they may be in the dataset a few times with thier totals for each team, so there is some redundancy in there...but I'm not going to take the time to clean it since it's beyond the scope of the question). I would also test out other machine learning algorithms. But the code above should at least get you going and playing around. Have fun!