Search code examples
pythonscikit-learnpipelinefeature-selection

How to manually select features for Scikit-Learn model regression?


There are various methods for doing automated feature selection in Scikit-learn.

E.g.

my_feature_selector = SelectKBest(score_func=f_regression, k=3)
my_feature_selector.fit_transform(X, y)

The selected features are then retrievable using

feature_idx = my_feature_selector.get_support(indices=True)
feature_names = X.columns[feature_idx]

(Note, in my case X and y are Pandas dataframes with named columns).

They are also saved as an attribute of a fitted model:

feature_names = my_model.feature_names_in_

However, I want to build a pipeline with a manual (i.e. pre-specified) set of features.

Obviously, I could manually select the features from the full data-set every time I do training or prediction:

model1_feature_names = ['MedInc', 'AveRooms', 'Latitude']
model1.fit(X[model1_feature_names], y)
y_pred1 = model1.predict(X[model1.feature_names_in_])

But I want a more convenient way to construct different models (or pipelines) each of which uses a potentially different set of (manually specified) features. Ideally, I would specify the feature_names_in_ as an initialization parameter so that later I don't have to worry about transforming the data and can run my model (or pipeline) on the full data set as follows:

model1.fit(X, y)  # uses a pre-defined sub-set of features in X
model2.fit(X, y)  # uses a different sub-set of features
y_pred1 = model1.predict(X)
y_pred2 = model2.predict(X)

Do I need to build a custom feature selector to do this? Surely there's an easier way.

I guess I was expecting to find something like a built-in FeatureSelector class that I could use in a pipeline as follows:

my_feature_selector1 = FeatureSelector(feature_names=['MedInc', 'AveRooms', 'Latitude'])
my_feature_selector1.fit_transform(X, y)  # This would do nothing

pipe1 = Pipeline([('feature_selector', my_feature_selector1), ('model', LinearRegression())])

Solution

  • You can use the ColumnTransformer for column selection. In the current case, you want to passthrough important columns and drop the un-important ones:

    my_feature_selector1 = ColumnTransformer([
      ("selector", "passthrough", ['MedInc', 'AveRooms', 'Latitude'])
    ], remainder = "drop")
    

    It is worth pointing out that ColumnTransformer supports both name-based and index-based column selection lists.