python scikit-learn classification pipeline hyperparameters

How to find the best parameters for different "steps" in a pipeline?

I have the following pipeline that combines preprocessing, feature selection, and an estimator:

## Selecting categorical and numeric features
numerical_ix = X.select_dtypes(include=np.number).columns
categorical_ix = X.select_dtypes(exclude=np.number).columns

## Create preprocessing pipelines for each datatype 
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('encoder', OrdinalEncoder()),
    ('scaler', StandardScaler())])

## Putting the preprocessing steps together
preprocessor = ColumnTransformer([
        ('numerical', numerical_transformer, numerical_ix),
        ('categorical', categorical_transformer, categorical_ix)],
         remainder='passthrough')

## Create example pipeline with kNN
example_pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('selector', SelectKBest(k=len(X.columns))), # keep the same amount of columns for now
    ('classifier', KNeighborsClassifier())
])

cross_val_score(example_pipe, X, y, cv=5, scoring='accuracy').mean()

I've written the following code that "tries out" different ks for SelectKBest and plots it.

But how can I, at the same time, also look for the optimal value for the k in the kNN classifier? I don't wanna necessarily plot it, just finding the optimal values. My guess would be GridSearchCV, but I don't know how to apply it to different steps in the pipeline.

k_range = list(range(1, len(X.columns))) # 1 until 18
k_scores = []

for k in k_range:
  example_pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('selector', SelectKBest(k=k)), # keep the same amount of columns for now
    ('classifier', KNeighborsClassifier())])
  score = cross_val_score(example_pipe, X, y, cv=5, scoring='accuracy').mean()
  k_scores.append(score)

plt.plot(k_range, k_scores)
plt.xlabel('Value of k in SelectKBEST')
plt.xticks(k_range, rotation=20) 
plt.ylabel('Cross-Validated Accuracy')

For those interested, the output is:

Solution

You are looking for the best n_neighbors value of KNeighborsClassifier.

Your guess in using GridSearchCV for this purpose is right. If you want to understand its usage in conjunction with a pipeline, please have a look at the documentation of Pipeline:

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’

In your case:

example_pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('selector', SelectKBest()),
    ('classifier', KNeighborsClassifier())
])

param_grid = {
    "selector__k": [5, 10, 15],
    "classifier__n_neighbors": [3, 5, 10]
}

gs = GridSearchCV(example_pipe, param_grid=param_grid)
gs.fit(X, y)

And then retrieve the best parameters with best_params_:

best_k = gs.best_params_['selector__k']
best_n_neighbors = gs.best_params_['classifier__n_neighbors']