I do not understand how to apply step_pca to preprocess my data

I am trying to understand how to apply step_pca to preprocess my data. Suppose I want to build a K-Nearest Neighbor classifier to the iris dataset. For the sake of simplicity, I will not split the original iris dataset into train and test. I will assume iris is the train dataset and I have some other observations as my test dataset.

I want to apply three transformations to the predictors in my train dataset:

  1. Center all predictor variables
  2. Scale all predictor variables
  3. PCA transform all predictor variables and keep a number of them that explains, at least, 80% of my data variance

So this is what I have:


iris_rec <- 
  recipe(Species ~ ., 
         data = iris) %>%
  # center/scale
  step_center(-Species) %>%
  step_scale(-Species) %>%
  # pca
  step_pca(-Species, threshold = 0.8) %>%
  # apply data transformation

#> Recipe
#> Inputs:
#>       role #variables
#>    outcome          1
#>  predictor          4
#> Training data contained 150 data points and no missing data.
#> Operations:
#> Centering for Sepal.Length, Sepal.Width, Petal.Length, Petal.... [trained]
#> Scaling for Sepal.Length, Sepal.Width, Petal.Length, Petal.... [trained]
#> PCA extraction with Sepal.Length, Sepal.Width, Petal.Length, Petal.W... [trained]

Ok, so far, so good. All the transformations are applied to my dataset. When I prepare my train dataset using juice, everything goes as expected:

# transformed training set

iris_train_t <- juice(iris_rec)

#> # A tibble: 150 × 3
#>    Species   PC1     PC2
#>    <fct>   <dbl>   <dbl>
#>  1 setosa  -2.26 -0.478 
#>  2 setosa  -2.07  0.672 
#>  3 setosa  -2.36  0.341 
#>  4 setosa  -2.29  0.595 
#>  5 setosa  -2.38 -0.645 
#>  6 setosa  -2.07 -1.48  
#>  7 setosa  -2.44 -0.0475
#>  8 setosa  -2.23 -0.222 
#>  9 setosa  -2.33  1.11  
#> 10 setosa  -2.18  0.467 
#> # … with 140 more rows

So, I have two predictors based on PCA (PC1 and PC2) and my response variable. However, when I proceed with my modelling, I get an error: all the models I test fail, as you can see below:

# cross validation


iris_train_cv <- vfold_cv(iris_train_t, v = 5)

# tuning

iris_knn_tune <-
    neighbors = tune(),
    weight_func = tune(),
    dist_power = tune()
  ) %>%
  set_engine("kknn") %>%

# grid search

iris_knn_grid <- 
  grid_regular(neighbors(range = c(3, 9)),
               levels = c(22, 2, 2))

# workflow creation

iris_wflow <- 
  workflow() %>% 
  add_recipe(iris_rec) %>%

# model assessment

iris_knn_fit_tune <- 
  iris_wflow %>% 
    resamples = iris_train_cv,
    grid = iris_knn_grid
#> x Fold1: preprocessor 1/1:
#>   Error in `check_training_set()`:
#>   ! Not all variables in the recipe are present in the supplied training...
#> x Fold2: preprocessor 1/1:
#>   Error in `check_training_set()`:
#>   ! Not all variables in the recipe are present in the supplied training...
#> x Fold3: preprocessor 1/1:
#>   Error in `check_training_set()`:
#>   ! Not all variables in the recipe are present in the supplied training...
#> x Fold4: preprocessor 1/1:
#>   Error in `check_training_set()`:
#>   ! Not all variables in the recipe are present in the supplied training...
#> x Fold5: preprocessor 1/1:
#>   Error in `check_training_set()`:
#>   ! Not all variables in the recipe are present in the supplied training...
#> Warning: All models failed. Run `show_notes(.Last.tune.result)` for more
#> information.

# cv results

#> Error in `estimate_tune_results()`:
#> ! All of the models failed. See the .notes column.

#> Backtrace:
#>     ▆
#>  1. ├─tune::collect_metrics(iris_knn_fit_tune)
#>  2. └─tune:::collect_metrics.tune_results(iris_knn_fit_tune)
#>  3.   └─tune::estimate_tune_results(x)
#>  4.     └─rlang::abort("All of the models failed. See the .notes column.")

I am suspecting my problem is with the formula I defined on my iris_rec recipe. The formula there is

Species ~ ., data = iris

which means

Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris

However, when I run my models, the predictor variables are PC1 and PC2, so I guess the formula should be

Species ~ ., data = iris_train_t


Species ~ PC1 + PC2, data = iris_train_t

How can I inform my model that my variables and dataset changed? All the others step_* I used on my tidymodels have worked, but I am struggling specifically with step_pca.


  • Two things that are confusing.

    First, you don't need to prep() or juice() a recipe before using it in a model or workflow. The tuning and resampling functions will be doing that within each resample.

    You can prep() and juice() if you want the training set processed to troubleshoot, visualize, or otherwise explore. But you don’t need to otherwise.

    Second, the recipe is basically a replacement for the formula. It knows what the predictors and outcomes are so there is rarely the need to use an additional formula on top of that.

    (The exception is for models that require special formulas but otherwise no).

    Here is updated code for you:

    iris_rec <- 
      recipe(Species ~ ., 
             data = iris) %>%
      # center/scale
      step_center(-Species) %>%
      step_scale(-Species) %>%
      # pca
      step_pca(-Species, threshold = 0.8)
    iris_train_cv <- vfold_cv(iris, v = 5)  #<- changes here
    # tuning
    iris_knn_tune <-
        neighbors = tune(),
        weight_func = tune(),
        dist_power = tune()
      ) %>%
      set_engine("kknn") %>%
    # grid search
    iris_knn_grid <- 
      grid_regular(neighbors(range = c(3, 9)),
                   levels = c(22, 2, 2))
    # workflow creation
    iris_wflow <- 
      workflow() %>% 
      add_recipe(iris_rec) %>%
    # model assessment
    iris_knn_fit_tune <- 
      iris_wflow %>% 
        resamples = iris_train_cv,
        grid = iris_knn_grid
    show_best(iris_knn_fit_tune, metric = "roc_auc")
    #> # A tibble: 5 × 9
    #>   neighbors weight_func dist_power .metric .estima…¹  mean     n std_err .config
    #>       <int> <chr>            <dbl> <chr>   <chr>     <dbl> <int>   <dbl> <chr>  
    #> 1         9 rectangular          1 roc_auc hand_till 0.976     5 0.00580 Prepro…
    #> 2         7 triangular           1 roc_auc hand_till 0.975     5 0.00688 Prepro…
    #> 3         9 triangular           2 roc_auc hand_till 0.975     5 0.00571 Prepro…
    #> 4         8 triangular           1 roc_auc hand_till 0.975     5 0.00655 Prepro…
    #> 5         9 triangular           1 roc_auc hand_till 0.975     5 0.00655 Prepro…
    #> # … with abbreviated variable name ¹​.estimator

    Created on 2022-10-13 with reprex v2.0.2