I'm trying to compare Linear Regression (Normal Equation) with SGD but it looks like SGD is far off. Am I doing something wrong?
Here's my code
x = np.random.randint(100, size=1000)
y = x * 0.10
slope, intercept, r_value, p_value, std_err = stats.linregress(x=x, y=y)
print("slope is %f and intercept is %s" % (slope,intercept))
#slope is 0.100000 and intercept is 1.61435309565e-11
And here's my SGD
x = x.reshape(1000,1)
clf = linear_model.SGDRegressor()
clf.fit(x, y, coef_init=0, intercept_init=0)
print(clf.intercept_)
print(clf.coef_)
#[ 1.46746270e+10]
#[ 3.14999003e+10]
I would have thought that the coef
and intercept
would be almost the same as the data is linear.
When I tried to run this code, I got an overflow error. I suspect you're having the same problem, but for some reason, it's not throwing an error.
If you scale down the features, everything works as expected. Using scipy.stats.linregress
:
>>> x = np.random.random(1000) * 10
>>> y = x * 0.10
>>> slope, intercept, r_value, p_value, std_err = stats.linregress(x=x, y=y)
>>> print("slope is %f and intercept is %s" % (slope,intercept))
slope is 0.100000 and intercept is -2.22044604925e-15
Using linear_model.SGDRegressor
:
>>> clf.fit(x[:,None], y)
SGDRegressor(alpha=0.0001, epsilon=0.1, eta0=0.01, fit_intercept=True,
l1_ratio=0.15, learning_rate='invscaling', loss='squared_loss',
n_iter=5, penalty='l2', power_t=0.25, random_state=None,
shuffle=False, verbose=0, warm_start=False)
>>> print("slope is %f and intercept is %s" % (clf.coef_, clf.intercept_[0]))
slope is 0.099763 and intercept is 0.00163353754797
The value for slope
is a little lower, but I'd guess that's because of the regularization.