python pandas numpy machine-learning linear-regression

Overflow error encountered in double scalars

Hi I am trying to do linear regression and this what happens to me when n I try to run the code the code is :

    import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('train.csv')
df = data.dropna()
X = np.array(df.x)
Y = np.array(df.y)
def compute_error(x,y,m,b):
    error = 0
    for i in range(len(x)):
        error+= (y[i]-(m*x[i]+b))**2
    return error / float(len(x))

def step_graident_descent(x,y,m,b , alpha):
    N = float(len(x))
    b_graident = 0
    m_graident = 0
    for i in range(0 , len(x)):
        x = X[i]
        y = Y[i]
        b_graident +=(-2/N) * (y-(m*x+b))
        m_graident += (-2/N) * x*(y-(m*x+b))
    new_m = m - alpha*m_graident
    new_b = b - alpha*b_graident

    return new_m , new_b
def graident_decsent(x,y,m,b,num_itters,alpha):
    for i in range(num_itters):
        m,b = step_graident_descent(x,y,m,b,alpha)
    return m,b
def run():
    b=0
    m=0
    numberOfIttertions = 1000
    m,b = graident_decsent(X , Y ,m,b,numberOfIttertions , 0.001)
    print(m,b)
    
if __name__ == '__main__':
    run()

and the error that i get is :

    linearRegression.py:22: RuntimeWarning: overflow encountered in double_scalars
  m_graident += (-2/N) * x*(y-(m*x+b))
linearRegression.py:21: RuntimeWarning: invalid value encountered in double_scalars
  b_graident +=(-2/N) * (y-(m*x+b))
linearRegression.py:22: RuntimeWarning: invalid value encountered in double_scalars
  m_graident += (-2/N) * x*(y-(m*x+b))

if any one can help me i would be so greatfull since i am stuck on this for about two months and thankyou

Solution

Edit: tl;dr Solution

Ok so here is the minimal reproducible example that I was talking of. I replace your X,Y by the following.

n = 10**2
X = np.linspace(0,10**6,n)
Y = 1.5*X+0.2*10**6*np.random.normal(size=n)

If I then run

b=0
m=0
numberOfIttertions = 1000
m,b = graident_decsent(X , Y ,m,b,numberOfIttertions , 0.001)

I get exactly the problem you describe. The only surprising thing is the ease of the solution. I just replace your alpha by 10**-14 and everything works fine.

Why and how to give a Minimal, Reproducible Example

Your example is not reproducible since we don't have train.csv. Generally both for understanding your problem yourself and to get concrete answers it is very helpful to have a very small example that people can run and tinker with it. E.g. maybe you can think of a much shorter input to your regression that also results in this error.

The first RuntimeWarning

But now to your question. Your first RuntimeWarning i.e.

    linearRegression.py:22: RuntimeWarning: overflow encountered in double_scalars
  m_graident += (-2/N) * x*(y-(m*x+b))

means x and hence m_graident are of type numpy.double=numpy.float64. This datatype can store numbers in the range (-1.79769313486e+308, 1.79769313486e+308). If you go bigger or smaller that's called an overflow. E.g. np.double(1.79769313486e+308) is still ok but if you multiply it by say 1.1 you get your favorite runtime warning. Notice that this is 'just' a warning and still runs. But it can't give you a number back since it would be too big. instead it gives you inf.

The other RuntimeWarnings

Ok but what does

linearRegression.py:21: RuntimeWarning: invalid value encountered in double_scalars
  b_graident +=(-2/N) * (y-(m*x+b))

mean?

It comes from calculating with the infinity that I just mentioned. Some calculations with infinity are valid.

np.inf-10**6 -> inf
np.inf+10**6 -> inf
np.inf/10**6 -> inf
np.inf*10**6 -> inf
np.inf*(-10**6) -> -inf
1/np.inf -> 0
np.inf *np.inf -> inf

but some are not and give nan i.e. not a number.

np.inf/np.inf 
np.inf-np.inf

These are called indeterminate forms in math since it depends on how you got to the infinity what you would get out. E.g.

(np.double(1e+309)+np.double(1e+309))-np.double(1e+309)
np.double(1e+309)-(np.double(1e+309)+np.double(1e+309))

are both inf-inf but you would expect different results. Getting a nan is unfortunate since calculations with nan yield always nan. And you can't use your gradients anymore once you add a nan.

Other resources

An other option is to use an existing implementation of linear regression. E.g. from scikit-learn. See