Including a Predictor in a Pipeline with Scikit-Learn

Actually this doubt is more like -- "why is this code working properly?".

I was working out a problem from a text book. Specifically, the problem was to build a Pipeline that had a Data Preparation phase (remove NA values, perform Feature Scaling etc.) and then a Prediction phase, which involves a Predictor trained on the transformed dataset and returning its predictions.

Here, we used a Support Vector Regressor module (sklearn.svm.svr).

I tried some code of mine, but it didn't work. So I looked up the actual solution provided by the author of the textbook -

prepare_select_and_predict_pipeline = Pipeline([
    ('preparation', data_prep),
    ('svm_reg', SVR(kernel='rbf',C=30000,gamma='scale'))
])

prepare_select_and_predict_pipeline.fit(x_train,y_train)

some_data = x_train.iloc[:4]
print("Predictions for a subset of Training Set:",prepare_select_and_predict_pipeline.predict(some_data))

I tried this code, and it does work as expected. How can it work properly? My main objections are:

We have only fit the dataset, but where are we actually transforming it? We are not calling a transform() function anywhere...
Also, how can we use the predict() function with this pipeline? SVR might be a part of this pipeline, but so are the other transformers, and they don't have a predict() function.

Thanks in advance for your answers!

Solution

When you perform fit on the Pipeline scikit-learn performs under the hood fit_transform of preprocessing step and fit on last step (classifier|regressor). When you call predict on the Pipeline scikit-learn perform transform on the preprocessing stage and predict on the last step.

Now, the definition of the model is not the last step but all the steps that takes in data and output results. The Pipeline is now a model. If you used GridSearchCV which has Pipelines, and Pipelines has preprocessing and final steps (regressor|classifier), then GridSearchCV is now the model.

See Pipeline Documentation