Search code examples
scikit-learnfeature-engineeringmlops

How to pass only necessary features to pipeline after SelectKBest


I have a regular tabular dataset, 100 features from the database are added

I want to push it into a regular sklearn.pipeline in which there will be preprocessing, encoding, some custom transformers, etc.

Penultimate estimator would be SelectKBest(k=10)

For the model, in fact, only 10 features are needed, and the pipeline will require all 100 features

And I would like to use in the Production model only the "necessary" features. I want to avoid extra features to reduce calculation time.

Of course I can rebuild the pipeline, but the whole sklearn is about not doing this. I don’t know how much this is a "standard" practice

I understand why it simply doesn’t work, because 150 features can actually go to the SelectKBest input. In this case, it is not obvious how to determine those "necessary" features.

Perhaps there are some other tools that work with this kind out of the box?

Basic example:

from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

data = load_diabetes(as_frame=True)
X, y = data.data, data.target

X = X.iloc[:, :10]

pipeline = Pipeline([
    ('scaler', StandardScaler()), 
    ('feature_selection', SelectKBest(score_func=f_regression, k=4)),
    ('model', LinearRegression())
])

pipeline.fit(X, y)

selected_features = pipeline.named_steps['feature_selection'].get_support()
selected_features = X.columns[selected_features]
print(f"Selected features: {selected_features}")
# Selected features: Index(['bmi', 'bp', 's4', 's5'], dtype='object')

prod_data = X[selected_features]

pred = pipeline.predict(prod_data)

# Here will be an Exception
# ValueError: The feature names should match those that were passed during fit.
# Feature names seen at fit time, yet now missing:
# - age
# - s1
# - s2
# - s3
# - s6
# - ...

Solution

  • I've generally suggested the solution you're trying to avoid: rebuild a pipeline without the selection steps and without the removed columns in the training set.

    It may be possible to identify and change the fitted attributes of each pipeline step (remove the entries from mean_ and scale_ from a scaler etc., reduce n_features_in_ and feature names in, ...), and with care that could be automated. But messing with internals is a bit risky: removing the wrong thing could produce no errors but perform the wrong scaling per column.

    Another low-tech solution: you don't care what the values of the removed columns are for a prediction row, so just make them up. Your sklearn pipeline will still process the fake values, but you don't need to gather those fields in your production environment.