Search code examples
pythonnumpyscikit-learnlinear-regression

Why my SGD is far off than my linear regression model?


I'm trying to compare Linear Regression (Normal Equation) with SGD but it looks like SGD is far off. Am I doing something wrong?

Here's my code

x = np.random.randint(100, size=1000)
y = x * 0.10
slope, intercept, r_value, p_value, std_err = stats.linregress(x=x, y=y)
print("slope is %f and intercept is %s" % (slope,intercept))
#slope is 0.100000 and intercept is 1.61435309565e-11

And here's my SGD

x = x.reshape(1000,1)
clf = linear_model.SGDRegressor()
clf.fit(x, y, coef_init=0, intercept_init=0)

print(clf.intercept_)
print(clf.coef_)

#[  1.46746270e+10]
#[  3.14999003e+10]

I would have thought that the coef and intercept would be almost the same as the data is linear.


Solution

  • When I tried to run this code, I got an overflow error. I suspect you're having the same problem, but for some reason, it's not throwing an error.

    If you scale down the features, everything works as expected. Using scipy.stats.linregress:

    >>> x = np.random.random(1000) * 10
    >>> y = x * 0.10
    >>> slope, intercept, r_value, p_value, std_err = stats.linregress(x=x, y=y)
    >>> print("slope is %f and intercept is %s" % (slope,intercept))
    slope is 0.100000 and intercept is -2.22044604925e-15
    

    Using linear_model.SGDRegressor:

    >>> clf.fit(x[:,None], y)
    SGDRegressor(alpha=0.0001, epsilon=0.1, eta0=0.01, fit_intercept=True,
           l1_ratio=0.15, learning_rate='invscaling', loss='squared_loss',
           n_iter=5, penalty='l2', power_t=0.25, random_state=None,
           shuffle=False, verbose=0, warm_start=False)
    >>> print("slope is %f and intercept is %s" % (clf.coef_, clf.intercept_[0]))
    slope is 0.099763 and intercept is 0.00163353754797
    

    The value for slope is a little lower, but I'd guess that's because of the regularization.