Hi I am trying to do linear regression and this what happens to me when n I try to run the code the code is :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('train.csv')
df = data.dropna()
X = np.array(df.x)
Y = np.array(df.y)
def compute_error(x,y,m,b):
error = 0
for i in range(len(x)):
error+= (y[i]-(m*x[i]+b))**2
return error / float(len(x))
def step_graident_descent(x,y,m,b , alpha):
N = float(len(x))
b_graident = 0
m_graident = 0
for i in range(0 , len(x)):
x = X[i]
y = Y[i]
b_graident +=(-2/N) * (y-(m*x+b))
m_graident += (-2/N) * x*(y-(m*x+b))
new_m = m - alpha*m_graident
new_b = b - alpha*b_graident
return new_m , new_b
def graident_decsent(x,y,m,b,num_itters,alpha):
for i in range(num_itters):
m,b = step_graident_descent(x,y,m,b,alpha)
return m,b
def run():
b=0
m=0
numberOfIttertions = 1000
m,b = graident_decsent(X , Y ,m,b,numberOfIttertions , 0.001)
print(m,b)
if __name__ == '__main__':
run()
and the error that i get is :
linearRegression.py:22: RuntimeWarning: overflow encountered in double_scalars
m_graident += (-2/N) * x*(y-(m*x+b))
linearRegression.py:21: RuntimeWarning: invalid value encountered in double_scalars
b_graident +=(-2/N) * (y-(m*x+b))
linearRegression.py:22: RuntimeWarning: invalid value encountered in double_scalars
m_graident += (-2/N) * x*(y-(m*x+b))
if any one can help me i would be so greatfull since i am stuck on this for about two months and thankyou
Ok so here is the minimal reproducible example that I was talking of. I replace your X,Y by the following.
n = 10**2
X = np.linspace(0,10**6,n)
Y = 1.5*X+0.2*10**6*np.random.normal(size=n)
If I then run
b=0
m=0
numberOfIttertions = 1000
m,b = graident_decsent(X , Y ,m,b,numberOfIttertions , 0.001)
I get exactly the problem you describe. The only surprising thing is the ease of the solution. I just replace your alpha by 10**-14 and everything works fine.
Your example is not reproducible since we don't have train.csv
. Generally both for understanding your problem yourself and to get concrete answers it is very helpful to have a very small example that people can run and tinker with it. E.g. maybe you can think of a much shorter input to your regression that also results in this error.
But now to your question. Your first RuntimeWarning
i.e.
linearRegression.py:22: RuntimeWarning: overflow encountered in double_scalars
m_graident += (-2/N) * x*(y-(m*x+b))
means x
and hence m_graident
are of type numpy.double=numpy.float64
. This datatype can store numbers in the range (-1.79769313486e+308, 1.79769313486e+308).
If you go bigger or smaller that's called an overflow. E.g.
np.double(1.79769313486e+308)
is still ok but if you multiply it by say 1.1
you get your favorite runtime warning. Notice that this is 'just' a warning and still runs. But it can't give you a number back since it would be too big. instead it gives you inf
.
Ok but what does
linearRegression.py:21: RuntimeWarning: invalid value encountered in double_scalars
b_graident +=(-2/N) * (y-(m*x+b))
mean?
It comes from calculating with the infinity that I just mentioned. Some calculations with infinity are valid.
np.inf-10**6 -> inf
np.inf+10**6 -> inf
np.inf/10**6 -> inf
np.inf*10**6 -> inf
np.inf*(-10**6) -> -inf
1/np.inf -> 0
np.inf *np.inf -> inf
but some are not and give nan
i.e. not a number.
np.inf/np.inf
np.inf-np.inf
These are called indeterminate forms in math since it depends on how you got to the infinity what you would get out. E.g.
(np.double(1e+309)+np.double(1e+309))-np.double(1e+309)
np.double(1e+309)-(np.double(1e+309)+np.double(1e+309))
are both inf-inf
but you would expect different results.
Getting a nan
is unfortunate since calculations with nan
yield always nan
. And you can't use your gradients anymore once you add a nan
.
An other option is to use an existing implementation of linear regression. E.g. from scikit-learn
. See