I have a question regarding how to update w,b in linear regression
After I tried to train more loops, the result of w,b doesn't seem to get close to the training set. I'm not sure what I did wrong in the code.
Here is my code (I break into parts)
1.Try to read and normalize
df = pd.read_csv('ML/cleaned_data_utf8.csv')
df.head()
df.dtypes
rmv_list = ['$',',']
for i in rmv_list:
df['budget']=df['budget'].str.replace(i,'',regex=False)
df['movie_gross_domestic']=df['movie_gross_domestic'].str.replace(i,'',regex=False)
df['movie_gross_worldwide']=df['movie_gross_worldwide'].str.replace(i,'',regex=False)
df['budget']=df['budget'].astype(float)/(10**9)
df['movie_gross_domestic']=df['movie_gross_domestic'].astype(float)/(10**9)
df['movie_gross_worldwide']=df['movie_gross_worldwide'].astype(float)/(10**9)
df.head()
df.dtypes
2.Show example dataframe
sum_gross = df['movie_gross_domestic'] + df['movie_gross_worldwide']
df['Total_gross'] = sum_gross
df.head()
3.Add function of gradient descent and update w/b
def gradient_descent(w,b,list_x,list_y,alpha,current_index):
diff_w = 0
diff_b = 0
training_size = len(list_x)
for i in range(training_size):
f_of_wb = w * list_x[i]+ b
diff_w_i = (f_of_wb - list_y[i]) * list_x[i]
diff_b_i = (f_of_wb - list_y[i])
diff_w += diff_w_i
diff_b += diff_b_i
w = w-(alpha*diff_w)*(1/training_size)
b = b-(alpha*diff_b)*(1/training_size)
sigma = 0
for i in range(training_size):
sigma += (list_y[i]-(w*list_x[i]+b))**2
loss = sigma/training_size
if current_index%1000 == 0:
print(loss)
return (w,b)
def update_w_b(num_loop,w,b,alpha,list_x,list_y):
current_index = 0
for i in range(num_loop):
(w,b) = gradient_descent(w,b,list_x,list_y,alpha,current_index)
current_index += 1
return (w,b)
4.execute to train predict line with initial value
w,b = update_w_b(num_loop = 10000 ,w = 2 ,b = 0 ,alpha=1.5,list_x = list(df['budget']),list_y = list(df['Total_gross']))
5.result of prediction line vs training data (too underfit)
x_axis = df['Total_gross']
y_axis = df['budget']
plt.xlabel("Total_gross($)")
plt.ylabel("budget($)")
plt.title("Relation between movie budget vs movie gross")
# line plot
y_predict = [round(w,2)*i/10 + round(b,2) for i in range(5)]
x_predict = [i/10 for i in range(5)]
# plot
plt.scatter(x_axis,y_axis,s=4)
plt.plot(x_predict,y_predict)
print(list(map(lambda x : round(x,2),y_predict)))
print(x_predict)
# show plot
plt.show()
print(round(w,2),round(b,2))
You were training your weights to predict Total_gross from budget, but when plotting the results you assign x_axis to Total_gross and y_axis to budget.