sklearn pipeline with PCA on feature subset using FunctionTransformer

Consider the task of chaining a PCA and regression, where PCA performs dimensionality reduction and regression does the prediction.

Example taken from the sklearn documentation:

import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

logistic = linear_model.LogisticRegression()

pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

n_components = [5, 10]
Cs = np.logspace(-4, 4, 3)

param_grid = dict(pca__n_components=n_components, logistic__C=Cs)
estimator = GridSearchCV(pipe,param_grid)
estimator.fit(X_digits, y_digits)

How can I perform dimensionality reduction only on a subset of my feature set using FunctionTransformer (for example, restrict PCA to the last ten columns of X_digits)?

Solution

You can first create a function (called last_ten_columns below) that returns the last 10 columns of the input X_digits. Create the function transformer that points to the function, and use it as the first step of the pipeline.

import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import FunctionTransformer

logistic = linear_model.LogisticRegression()

pca = decomposition.PCA()

def last_ten_columns(X):
    return X[:, -10:]

func_trans = FunctionTransformer(last_ten_columns)

pipe = Pipeline(steps=[('func_trans',func_trans), ('pca', pca), ('logistic', logistic)])

digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

n_components = [5, 10]
Cs = np.logspace(-4, 4, 3)

param_grid = dict(pca__n_components=n_components, logistic__C=Cs)
estimator = GridSearchCV(pipe, param_grid)
estimator.fit(X_digits, y_digits)