I'm working through some examples of Linear Regression under different scenarios, comparing the results from using Normalizer
and StandardScaler
, and the results are puzzling.
I'm using the boston housing dataset, and prepping it this way:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
#load the data
df = pd.DataFrame(boston.data)
df.columns = boston.feature_names
df['PRICE'] = boston.target
I'm currently trying to reason about the results I get from the following scenarios:
normalize=True
vs using Normalizer
fit_intercept = False
with and without standardization.Collectively, I find the results confusing.
Here's how I'm setting everything up:
# Prep the data
X = df.iloc[:, :-1]
y = df.iloc[:, -1:]
normal_X = Normalizer().fit_transform(X)
scaled_X = StandardScaler().fit_transform(X)
#now prepare some of the models
reg1 = LinearRegression().fit(X, y)
reg2 = LinearRegression(normalize=True).fit(X, y)
reg3 = LinearRegression().fit(normal_X, y)
reg4 = LinearRegression().fit(scaled_X, y)
reg5 = LinearRegression(fit_intercept=False).fit(scaled_X, y)
Then, I created 3 separate dataframes to compare the R_score, coefficient values, and predictions from each model.
To create the dataframe to compare coefficient values from each model, I did the following:
#Create a dataframe of the coefficients
coef = pd.DataFrame({
'coeff': reg1.coef_[0],
'coeff_normalize_true': reg2.coef_[0],
'coeff_normalizer': reg3.coef_[0],
'coeff_scaler': reg4.coef_[0],
'coeff_scaler_no_int': reg5.coef_[0]
})
Here's how I created the dataframe to compare the R^2 values from each model:
scores = pd.DataFrame({
'score': reg1.score(X, y),
'score_normalize_true': reg2.score(X, y),
'score_normalizer': reg3.score(normal_X, y),
'score_scaler': reg4.score(scaled_X, y),
'score_scaler_no_int': reg5.score(scaled_X, y)
}, index=range(1)
)
Lastly, here's the dataframe that compares the predictions from each:
predictions = pd.DataFrame({
'pred': reg1.predict(X).ravel(),
'pred_normalize_true': reg2.predict(X).ravel(),
'pred_normalizer': reg3.predict(normal_X).ravel(),
'pred_scaler': reg4.predict(scaled_X).ravel(),
'pred_scaler_no_int': reg5.predict(scaled_X).ravel()
}, index=range(len(y)))
Here are the resulting dataframes:
I have three questions that I can't reconcile:
normalize=False
does nothing. I can understand having predictions and R^2 values that are the same, but my features have different numerical scales, so I'm not sure why normalizing would have no effect at all. This is doubly confusing when you consider that using StandardScaler
changes the coefficients considerably.Normalizer
causes such radically different coefficient values from the others, especially when the model with LinearRegression(normalize=True)
makes no change at all.If you were to look at the documentation for each, it appears they're very similar if not identical.
From the docs on sklearn.linear_model.LinearRegression():
normalize : boolean, optional, default False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.
Meanwhile, the docs on sklearn.preprocessing.Normalizer
states that it normalizes to the l2 norm by default.
I don't see a difference between what these two options do, and I don't see why one would have such radical differences in coefficient values from the other.
StandardScaler
are coherent to me, but I don't understand why the model using StandardScaler
and setting set_intercept=False
performs so poorly.From the docs on the Linear Regression module:
fit_intercept : boolean, optional, default True
whether to calculate the intercept for this model. If set to False, no
intercept will be used in calculations (e.g. data is expected to be already
centered).
The StandardScaler
centers your data, so I don't understand why using it with fit_intercept=False
produces incoherent results.
Sklearn
de-normalize the co-efficients behind the scenes after calculating the co-effs from normalized input data. Reference This de-normalization has been done because for test data, we can directly apply the co-effs. and get the prediction without normalizing the test data.
Hence, setting normalize=True
do have impact on co-efficients but they dont affect the best fit line anyway.
Normalizer
does the normalization with respect to each sample (meaning row-wise). You see the reference code here.Normalize samples individually to unit norm.
whereas normalize=True
does the normalization with respect to each column/ feature. Reference
Example to understand the impact of normalization at different dimension of the data. Let us take two dimensions x1 & x2 and y be the target variable. Target variable value is color coded in the figure.
import matplotlib.pyplot as plt
from sklearn.preprocessing import Normalizer,StandardScaler
from sklearn.preprocessing.data import normalize
n=50
x1 = np.random.normal(0, 2, size=n)
x2 = np.random.normal(0, 2, size=n)
noise = np.random.normal(0, 1, size=n)
y = 5 + 0.5*x1 + 2.5*x2 + noise
fig,ax=plt.subplots(1,4,figsize=(20,6))
ax[0].scatter(x1,x2,c=y)
ax[0].set_title('raw_data',size=15)
X = np.column_stack((x1,x2))
column_normalized=normalize(X, axis=0)
ax[1].scatter(column_normalized[:,0],column_normalized[:,1],c=y)
ax[1].set_title('column_normalized data',size=15)
row_normalized=Normalizer().fit_transform(X)
ax[2].scatter(row_normalized[:,0],row_normalized[:,1],c=y)
ax[2].set_title('row_normalized data',size=15)
standardized_data=StandardScaler().fit_transform(X)
ax[3].scatter(standardized_data[:,0],standardized_data[:,1],c=y)
ax[3].set_title('standardized data',size=15)
plt.subplots_adjust(left=0.3, bottom=None, right=0.9, top=None, wspace=0.3, hspace=None)
plt.show()
You could see that best fit line for data in fig 1,2 and 4 would be the same; signifies that the R2_-score will not change due to column/feature normalization or standardizing data. Just that, it ends up with different co-effs. values.
Note: best fit line for fig3
would be different.
The prediction with intercept as zero would be expected to perform bad for problems where target variables are not scaled (mean =0). You can see a difference of 22.532 in every row, which signifies the impact of the output.