Search code examples
rpcatidymodelsr-recipes

step_pca() arguments are not being applied


I'm new to tidymodels but apparently the step_pca() arguments such as nom_comp or threshold are not being implemented when being trained. as in example below, I'm still getting 4 component despite setting nom_comp = 2.

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
rec <- recipe( ~ ., data = USArrests) %>%
  step_normalize(all_numeric()) %>%
  step_pca(all_numeric(), num_comp = 2)

prep(rec) %>% tidy(number = 2, type = "coef") %>%
  pivot_wider(names_from = component, values_from = value, id_cols = terms)
#> # A tibble: 4 x 5
#>   terms       PC1    PC2    PC3     PC4
#>   <chr>     <dbl>  <dbl>  <dbl>   <dbl>
#> 1 Murder   -0.536  0.418 -0.341  0.649 
#> 2 Assault  -0.583  0.188 -0.268 -0.743 
#> 3 UrbanPop -0.278 -0.873 -0.378  0.134 
#> 4 Rape     -0.543 -0.167  0.818  0.0890

Solution

  • The full PCA is determined (so you can still compute the variances of each term) and num_comp only specifies how many of the components are retained as predictors. If you want to specify the maximal rank, you can pass that through options:

    library(recipes)
    #> Loading required package: dplyr
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    #> 
    #> Attaching package: 'recipes'
    #> The following object is masked from 'package:stats':
    #> 
    #>     step
    rec <- recipe( ~ ., data = USArrests) %>%
        step_normalize(all_numeric()) %>%
        step_pca(all_numeric(), num_comp = 2, options = list(rank. = 2))
    
    prep(rec) %>% tidy(number = 2, type = "coef")
    #> # A tibble: 8 × 4
    #>   terms     value component id       
    #>   <chr>     <dbl> <chr>     <chr>    
    #> 1 Murder   -0.536 PC1       pca_AoFOm
    #> 2 Assault  -0.583 PC1       pca_AoFOm
    #> 3 UrbanPop -0.278 PC1       pca_AoFOm
    #> 4 Rape     -0.543 PC1       pca_AoFOm
    #> 5 Murder    0.418 PC2       pca_AoFOm
    #> 6 Assault   0.188 PC2       pca_AoFOm
    #> 7 UrbanPop -0.873 PC2       pca_AoFOm
    #> 8 Rape     -0.167 PC2       pca_AoFOm
    

    Created on 2022-01-12 by the reprex package (v2.0.1)

    You could also control this via the tol argument from stats::prcomp(), also passed in as an option.