python scikit-learn pipeline training-data test-data

How to use sklearn's standard scaler with make_pipeline?

I am used to running sklearn's standard scaler the following way:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
scaled_X_train = scaler.transform(X_train)

Where X_train is an array containing the features in my training dataset.

I may then use the same scaler to scale the features in my test dataset X_test:

scaled_X_test = scaler.transform(X_test)

I know that I may also "bake" the scaler in the model, using sklearn's make_pipeline:

from sklearn.pipeline import make_pipeline
clf = make_pipeline(preprocessing.StandardScaler(), RandomForestClassifier(n_estimators=100))

But then how do I use the scaler? Is it enough to call the model like I normally would, i.e.:

clf.fit(X_train,y_train)

And then:

y_pred = clf.predict(X_test)

Solution

Yes, that is correct. It's also a good idea to bake the preprocessing into a pipeline, to avoid the common pitfall of scaling the test and training datasets independently.

When calling clf.fit(X_train,y_train), the pipeline will fit the Scaler on X_train, and subsequently use that fit to preprocess your test dataset.

See an example at the beginning of the "common pitfalls and recommended practices" documentation.

We recommend using a Pipeline, which makes it easier to chain transformations with estimators, and reduces the possibility of forgetting a transformation.

So the fact that you don't "use" the Scaler yourself is per design.

With that said, if you wanted for some reason to independently access the scaler from a pipeline, for example to check it's values, you could do so:

clf.fit(X_train,y_train)
# For example, get the first step of the pipeline steps[0]
# then get the actual scaler object [1]
clf.steps[0][1].scale_