I have seen several questions + answers for similar posts in SO (ex. 1, ex. 2, ex. 3), but none seem to really address the problem in the context of tidymodels
.
I am trying to use a second-order step_poly
function inside a preprocessing recipe to prepare for a KNN model. The sample data is pulled from a Kaggle Playground competition. The training data itself is ~360,000 x 17 with all numeric predictors.
A light preprocessing reprex is:
rec <- recipe(cost ~ ., data = train) |>
update_role(id, new_role = 'id') |>
step_normalize(all_numeric_predictors())
step_poly(all_predictors()) |> # this line fails??
step_interact(~ all_predictors():all_predictors())
When going to prep the recipe, prep(rec)
, an error is thrown:
Error in poly(degree = 2L, x = c(0.871948016751444, 0.871948016751444, : 'degree' must be less than number of unique points
This also persists at tuning time. I understand the rationale behind why the polynomial degree must be less than the number of unique points, but I do not understand where the "unique points" are coming from. Why does my data only have a single unique point? And how can I fix this?
Any and all help is greatly appreciated!
You are correct in seeing that the problem comes from not having enough unique values in the columns you are trying to apply step_poly()
to.
The default value of degree
in step_poly()
is 2, so it can only be apply to variables with at least 3 unique values.
We can use the function n_distinct()
inside a map to find the number of distinct values for each variable.
library(tidymodels)
train <- readr::read_csv("~/Desktop/train.csv.zip")
train <- janitor::clean_names(train)
train |>
select(where(is.numeric)) |>
map_dbl(n_distinct) |>
sort()
#> recyclable_package low_fat coffee_bar
#> 2 2 2
#> video_store salad_bar prepared_food
#> 2 2 2
#> florist avg_cars_at_home_approx_1 unit_sales_in_millions
#> 2 5 6
#> total_children num_children_at_home store_sqft
#> 6 6 20
#> units_per_case cost gross_weight
#> 36 328 384
#> store_sales_in_millions id
#> 1044 360336
We see a lot of them just have 2 values, so you will have to manually specify which variables to have it applied to
rec <- recipe(cost ~ ., data = train) |>
update_role(id, new_role = 'id') |>
step_normalize(all_numeric_predictors()) |>
step_poly(cost, gross_weight, store_sales_in_millions) |>
step_interact(~ all_predictors():all_predictors())
rec |>
prep()
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 15
#> id: 1
#>
#> ── Training information
#> Training data contained 360336 data points and no incomplete rows.
#>
#> ── Operations
#> • Centering and scaling for: store_sales_in_millions, ... | Trained
#> • Orthogonal polynomials on: cost, gross_weight, ... | Trained
#> • Interactions with: (unit_sales_in_millions + total_children +
#> num_children_at_home + avg_cars_at_home_approx_1 + recyclable_package +
#> low_fat + units_per_case + store_sqft + coffee_bar + video_store + salad_bar
#> + prepared_food + florist + cost_poly_1 + cost_poly_2 + gross_weight_poly_1 +
#> gross_weight_poly_2 + store_sales_in_millions_poly_1 +
#> store_sales_in_millions_poly_2):(unit_sales_in_millions + total_children +
#> num_children_at_home + avg_cars_at_home_approx_1 + recyclable_package +
#> low_fat + units_per_case + store_sqft + coffee_bar + video_store + salad_bar
#> + prepared_food + florist + cost_poly_1 + cost_poly_2 + gross_weight_poly_1 +
#> gross_weight_poly_2 + store_sales_in_millions_poly_1 +
#> store_sales_in_millions_poly_2) | Trained
Created on 2023-05-01 with reprex v2.0.2