Search code examples
rtidymodels

step_poly error in tidymodels: "'degree' must be less than number of unique points"


I have seen several questions + answers for similar posts in SO (ex. 1, ex. 2, ex. 3), but none seem to really address the problem in the context of tidymodels.

I am trying to use a second-order step_poly function inside a preprocessing recipe to prepare for a KNN model. The sample data is pulled from a Kaggle Playground competition. The training data itself is ~360,000 x 17 with all numeric predictors.

A light preprocessing reprex is:

rec <- recipe(cost ~ ., data = train) |> 
  update_role(id, new_role = 'id') |>
  step_normalize(all_numeric_predictors())
  step_poly(all_predictors()) |> # this line fails??
  step_interact(~ all_predictors():all_predictors())

When going to prep the recipe, prep(rec), an error is thrown:

Error in poly(degree = 2L, x = c(0.871948016751444, 0.871948016751444, : 'degree' must be less than number of unique points

This also persists at tuning time. I understand the rationale behind why the polynomial degree must be less than the number of unique points, but I do not understand where the "unique points" are coming from. Why does my data only have a single unique point? And how can I fix this?

Any and all help is greatly appreciated!


Solution

  • You are correct in seeing that the problem comes from not having enough unique values in the columns you are trying to apply step_poly() to.

    The default value of degree in step_poly() is 2, so it can only be apply to variables with at least 3 unique values.

    We can use the function n_distinct() inside a map to find the number of distinct values for each variable.

    library(tidymodels)
    
    train <- readr::read_csv("~/Desktop/train.csv.zip")
    train <- janitor::clean_names(train)
    
    train |> 
     select(where(is.numeric)) |>
     map_dbl(n_distinct) |>
      sort()
    #>        recyclable_package                   low_fat                coffee_bar 
    #>                         2                         2                         2 
    #>               video_store                 salad_bar             prepared_food 
    #>                         2                         2                         2 
    #>                   florist avg_cars_at_home_approx_1    unit_sales_in_millions 
    #>                         2                         5                         6 
    #>            total_children      num_children_at_home                store_sqft 
    #>                         6                         6                        20 
    #>            units_per_case                      cost              gross_weight 
    #>                        36                       328                       384 
    #>   store_sales_in_millions                        id 
    #>                      1044                    360336
    

    We see a lot of them just have 2 values, so you will have to manually specify which variables to have it applied to

    rec <- recipe(cost ~ ., data = train) |> 
      update_role(id, new_role = 'id') |>
      step_normalize(all_numeric_predictors()) |>
      step_poly(cost, gross_weight, store_sales_in_millions) |>
      step_interact(~ all_predictors():all_predictors())
    
    rec |>
      prep()
    #> 
    #> ── Recipe ──────────────────────────────────────────────────────────────────────
    #> 
    #> ── Inputs 
    #> Number of variables by role
    #> outcome:    1
    #> predictor: 15
    #> id:         1
    #> 
    #> ── Training information 
    #> Training data contained 360336 data points and no incomplete rows.
    #> 
    #> ── Operations 
    #> • Centering and scaling for: store_sales_in_millions, ... | Trained
    #> • Orthogonal polynomials on: cost, gross_weight, ... | Trained
    #> • Interactions with: (unit_sales_in_millions + total_children +
    #>   num_children_at_home + avg_cars_at_home_approx_1 + recyclable_package +
    #>   low_fat + units_per_case + store_sqft + coffee_bar + video_store + salad_bar
    #>   + prepared_food + florist + cost_poly_1 + cost_poly_2 + gross_weight_poly_1 +
    #>   gross_weight_poly_2 + store_sales_in_millions_poly_1 +
    #>   store_sales_in_millions_poly_2):(unit_sales_in_millions + total_children +
    #>   num_children_at_home + avg_cars_at_home_approx_1 + recyclable_package +
    #>   low_fat + units_per_case + store_sqft + coffee_bar + video_store + salad_bar
    #>   + prepared_food + florist + cost_poly_1 + cost_poly_2 + gross_weight_poly_1 +
    #>   gross_weight_poly_2 + store_sales_in_millions_poly_1 +
    #>   store_sales_in_millions_poly_2) | Trained
    

    Created on 2023-05-01 with reprex v2.0.2