Consider the task of chaining a PCA and regression, where PCA performs dimensionality reduction and regression does the prediction.
Example taken from the sklearn documentation:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
digits = datasets.load_digits()
X_digits =
y_digits =
n_components = [5, 10]
Cs = np.logspace(-4, 4, 3)
param_grid = dict(pca__n_components=n_components, logistic__C=Cs)
estimator = GridSearchCV(pipe,param_grid), y_digits)
How can I perform dimensionality reduction only on a subset of my feature set using FunctionTransformer (for example, restrict PCA to the last ten columns of X_digits)?
You can first create a function (called last_ten_columns
below) that returns the last 10 columns of the input X_digits
. Create the function transformer that points to the function, and use it as the first step of the pipeline.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import FunctionTransformer
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
def last_ten_columns(X):
return X[:, -10:]
func_trans = FunctionTransformer(last_ten_columns)
pipe = Pipeline(steps=[('func_trans',func_trans), ('pca', pca), ('logistic', logistic)])
digits = datasets.load_digits()
X_digits =
y_digits =
n_components = [5, 10]
Cs = np.logspace(-4, 4, 3)
param_grid = dict(pca__n_components=n_components, logistic__C=Cs)
estimator = GridSearchCV(pipe, param_grid), y_digits)