Search code examples
pythonscikit-learnlinear-regressionpipeline

How Linear Regression coefficients are stored in Sklearn pipelines?


I have been trying to understand the use of Sklearn Pipelines.

I run the following code to scale my data and fit a linear regression within a Pipeline and plot the regression:

pipe = make_pipeline(StandardScaler(), LinearRegression())
pipe.fit(x_train, y_train) 

xfit = np.linspace(0, 1.25, 50) #Fake data to plot straight line
yfit = pipe.predict(xfit[:, np.newaxis])
plt.scatter(x_train, y_train)
plt.plot(xfit, yfit, color='r')

PlotLinearRegression

However, when trying to plot the linear regression by hand i.e. finding the linear regression coefficients and intercept from the LinearRegression object stored in the Pipeline with the following code. Dark magic is involved as this does not display the same regression (coef + intercept) as the ones used by the Pipeline (see graph).

print("Linear Regression intercept: ", pipe['linearregression'].intercept_)
print("Linear Regression coefficients: ", pipe['linearregression'].coef_)

The StandardScaler might be involved as removing it from the pipeline allows to find the regression coefficients using the code cell above.

Where the unnormalised regression coefficients and intercept are stored in the pipeline object? Or equivalently, how can we compute them from the normalised regression coefficients, using the standard scaler?


Solution

  • Any ideas on where the unnormalised regression coefficients and intercept are stored in the pipeline object?

    They are not, because the pipeline doesn't do anything besides string together the transformer(s) and model. And the model object only knows about the scaled input data.

    Or equivalently, how can we compute them from the normalised regression coefficients, using the standard scaler?

    StandardScaler has the attributes mean_ and scale_ (also var_), which contain the per-column means and standard deviations of the original data that are used to transform the data. So we have:

    y_hat = lr.coef_ * x_transformed + lr.intercept_
          = lr.coef_ * (x - scaler.mean_) / scaler.scale_ + lr.intercept_
          = (lr.coef_ / scaler.scale_) * x + (lr.intercept_ - lr.coef_ * scaler.mean_ / scaler.scale_)
    

    That is, your unnormalized regression coefficient is lr.coef_ / scaler.scale_ and the unnormalized intercept is lr.intercept_ - lr.coef_ * scaler.mean_ / scaler.scale_.

    (I haven't tested that, so do check that it makes sense.)