step_poly error in tidymodels: "'degree' must be less than number of unique points"

I have seen several questions + answers for similar posts in SO (ex. 1, ex. 2, ex. 3), but none seem to really address the problem in the context of tidymodels.

I am trying to use a second-order step_poly function inside a preprocessing recipe to prepare for a KNN model. The sample data is pulled from a Kaggle Playground competition. The training data itself is ~360,000 x 17 with all numeric predictors.

A light preprocessing reprex is:

rec <- recipe(cost ~ ., data = train) |> 
  update_role(id, new_role = 'id') |>
  step_normalize(all_numeric_predictors())
  step_poly(all_predictors()) |> # this line fails??
  step_interact(~ all_predictors():all_predictors())

When going to prep the recipe, prep(rec), an error is thrown:

Error in poly(degree = 2L, x = c(0.871948016751444, 0.871948016751444, : 'degree' must be less than number of unique points

This also persists at tuning time. I understand the rationale behind why the polynomial degree must be less than the number of unique points, but I do not understand where the "unique points" are coming from. Why does my data only have a single unique point? And how can I fix this?

Any and all help is greatly appreciated!

Solution

You are correct in seeing that the problem comes from not having enough unique values in the columns you are trying to apply step_poly() to.

The default value of degree in step_poly() is 2, so it can only be apply to variables with at least 3 unique values.

We can use the function n_distinct() inside a map to find the number of distinct values for each variable.

library(tidymodels)

train <- readr::read_csv("~/Desktop/train.csv.zip")
train <- janitor::clean_names(train)

train |> 
 select(where(is.numeric)) |>
 map_dbl(n_distinct) |>
  sort()
#>        recyclable_package                   low_fat                coffee_bar 
#>                         2                         2                         2 
#>               video_store                 salad_bar             prepared_food 
#>                         2                         2                         2 
#>                   florist avg_cars_at_home_approx_1    unit_sales_in_millions 
#>                         2                         5                         6 
#>            total_children      num_children_at_home                store_sqft 
#>                         6                         6                        20 
#>            units_per_case                      cost              gross_weight 
#>                        36                       328                       384 
#>   store_sales_in_millions                        id 
#>                      1044                    360336

We see a lot of them just have 2 values, so you will have to manually specify which variables to have it applied to

rec <- recipe(cost ~ ., data = train) |> 
  update_role(id, new_role = 'id') |>
  step_normalize(all_numeric_predictors()) |>
  step_poly(cost, gross_weight, store_sales_in_millions) |>
  step_interact(~ all_predictors():all_predictors())

rec |>
  prep()
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs 
#> Number of variables by role
#> outcome:    1
#> predictor: 15
#> id:         1
#> 
#> ── Training information 
#> Training data contained 360336 data points and no incomplete rows.
#> 
#> ── Operations 
#> • Centering and scaling for: store_sales_in_millions, ... | Trained
#> • Orthogonal polynomials on: cost, gross_weight, ... | Trained
#> • Interactions with: (unit_sales_in_millions + total_children +
#>   num_children_at_home + avg_cars_at_home_approx_1 + recyclable_package +
#>   low_fat + units_per_case + store_sqft + coffee_bar + video_store + salad_bar
#>   + prepared_food + florist + cost_poly_1 + cost_poly_2 + gross_weight_poly_1 +
#>   gross_weight_poly_2 + store_sales_in_millions_poly_1 +
#>   store_sales_in_millions_poly_2):(unit_sales_in_millions + total_children +
#>   num_children_at_home + avg_cars_at_home_approx_1 + recyclable_package +
#>   low_fat + units_per_case + store_sqft + coffee_bar + video_store + salad_bar
#>   + prepared_food + florist + cost_poly_1 + cost_poly_2 + gross_weight_poly_1 +
#>   gross_weight_poly_2 + store_sales_in_millions_poly_1 +
#>   store_sales_in_millions_poly_2) | Trained

^{Created on 2023-05-01 with reprex v2.0.2}