I'm currently working on gradient descent projects.
I chose nba stats as my data so I downloaded 3Pts data and pts data from basketball reference and I have successfully plotted a scatter plot. However, the result does not seem right.
My scatter plot is going towards right-upside (since more 3points made generally means more points scored, so it makes sense)
But my gradient descent line is going to left-upside, I don't know what's wrong.
import pandas as pd
import numpy as np
from sklearn import linear_model
from matplotlib import pyplot as plt
data = pd.read_csv('C:/Users/jeehw/Documents/FG3M_PTS_2021.csv')
X = data.iloc[:,1]
Y = data.iloc[:,2]
plt.figure(figsize=(8,6))
plt.xlabel('FG3M')
plt.ylabel('PTS')
plt.scatter(X,Y)
plt.show()
m = 0
c = 0
L = 0.001
epochs = 200
n = float(len(X))
for i in range(len(X)):
Y_pred = m*X + c
m_Grad = (1/n) * sum(X * (Y_pred - Y))
c_Grad = (1/n) * sum(Y_pred - Y)
m = m - L* m_Grad
c = c - L* c_Grad
Y_pred = m*X + c
plt.scatter(X, Y)
plt.scatter(X, Y_pred)
plt.show()
A few things in this code don't really make sense. Are you trying to do the regression from scratch? Because you do import scikit learn but never apply it. You can refer to this link on how to use scikit learn regression here. I would also consider playing around with other algorithms too.
I believe this is what you are trying to do here:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from matplotlib import pyplot as plt
#data = pd.read_csv('C:/Users/jeehw/Documents/FG3M_PTS_2021.csv')
raw_data = pd.read_html('https://www.basketball-reference.com/leagues/NBA_2021_totals.html')[0]
raw_data = raw_data[raw_data['Rk'].ne('Rk')]
data = raw_data[['Player','3P','PTS']]
data[['3P','PTS']] = data[['3P','PTS']].astype(int)
X = data.iloc[:]['3P'].values
y = data.iloc[:]['PTS'].values
plt.figure(figsize=(8,6))
plt.xlabel('FG3M')
plt.ylabel('PTS')
plt.scatter(X,y)
plt.xticks(np.arange(min(X), max(X)+1, 20))
plt.yticks(np.arange(min(y), max(y)+1, 100))
plt.show()
# Split data into test and Train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(X_train.reshape(-1, 1), y_train)
# Make predictions using the testing set
y_pred = regr.predict(X_test.reshape(-1, 1))
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'
% mean_squared_error(y_test, y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
% r2_score(y_test, y_pred))
# Plot outputs
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='red', linewidth=3)
plt.xticks(np.arange(min(X_test), max(X_test)+1, 20))
plt.yticks(np.arange(min(y_pred), max(y_pred)+1, 100))
plt.xlabel('FG3M')
plt.ylabel('PTS')
plt.show()
There is some noise in there. You have a lot of players who score a lot of points, who NEVER shoot a 3pt, let alone make one. So I would consider do a good amount of data cleaning first (maybe take only players who've had at least 50 3 point attempts? Or Get rid of Centers? Also if a player changes teams, they may be in the dataset a few times with thier totals for each team, so there is some redundancy in there...but I'm not going to take the time to clean it since it's beyond the scope of the question). I would also test out other machine learning algorithms. But the code above should at least get you going and playing around. Have fun!