Search code examples
pythonscikit-learnpca

sklearn pipeline with PCA on feature subset using FunctionTransformer


Consider the task of chaining a PCA and regression, where PCA performs dimensionality reduction and regression does the prediction.

Example taken from the sklearn documentation:

import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

logistic = linear_model.LogisticRegression()

pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

n_components = [5, 10]
Cs = np.logspace(-4, 4, 3)

param_grid = dict(pca__n_components=n_components, logistic__C=Cs)
estimator = GridSearchCV(pipe,param_grid)
estimator.fit(X_digits, y_digits)

How can I perform dimensionality reduction only on a subset of my feature set using FunctionTransformer (for example, restrict PCA to the last ten columns of X_digits)?


Solution

  • You can first create a function (called last_ten_columns below) that returns the last 10 columns of the input X_digits. Create the function transformer that points to the function, and use it as the first step of the pipeline.

    import numpy as np
    import matplotlib.pyplot as plt
    
    from sklearn import linear_model, decomposition, datasets
    from sklearn.pipeline import Pipeline
    from sklearn.model_selection import GridSearchCV
    from sklearn.preprocessing import FunctionTransformer
    
    logistic = linear_model.LogisticRegression()
    
    pca = decomposition.PCA()
    
    def last_ten_columns(X):
        return X[:, -10:]
    
    func_trans = FunctionTransformer(last_ten_columns)
    
    pipe = Pipeline(steps=[('func_trans',func_trans), ('pca', pca), ('logistic', logistic)])
    
    digits = datasets.load_digits()
    X_digits = digits.data
    y_digits = digits.target
    
    n_components = [5, 10]
    Cs = np.logspace(-4, 4, 3)
    
    param_grid = dict(pca__n_components=n_components, logistic__C=Cs)
    estimator = GridSearchCV(pipe, param_grid)
    estimator.fit(X_digits, y_digits)