I want to use the step_impute_knn
function from the recipe
package to impute missing values. This function uses the Gower distance as a distance metric, which is suitable when predictors are a mixture of categorical and continuous data. But as far as I can see, there is no way to use this function with the tune()
parameter, since the tuning must be done on a (parsnip) model. But the only parsnip model is nearest_neighbor
function that doesn't have Gower distance as an option.
Sample data:
train <- structure(list(PassengerId = c("0001_01", "0002_01", "0003_01",
"0003_02", "0004_01", "0005_01"), HomePlanet = c("Europa", "Earth",
"Europa", "Europa", "Earth", NA), CryoSleep = c("False",
"False", "False", "False", "False", "False"), Cabin = c("B/0/P",
"F/0/S", "A/0/S", "A/0/S", "F/1/S", "F/0/P"), Destination = c("TRAPPIST-1e",
"TRAPPIST-1e", "TRAPPIST-1e", "TRAPPIST-1e", "TRAPPIST-1e", "PSO J318.5-22"
), Age = c(39, 24, 58, 33, 16, 44), VIP = c("False", "False",
"True", "False", "False", "False"), RoomService = c(0, 109, 43,
0, 303, 0), FoodCourt = c(0, 9, 3576, 1283, 70, 483), ShoppingMall = c(0,
25, 0, 371, 151, 0), Spa = c(0, 549, 6715, 3329, 565, 291), VRDeck = c(0,
44, 49, 193, 2, 0), Name = c("Maham Ofracculy", "Juanna Vines",
"Altark Susent", "Solam Susent", "Willy Santantines", "Sandie Hinetthews"
), Transported = c("False", "True", "False", "False", "True",
"True")), row.names = c(NA, 6L), class = "data.frame")
What I have so far:
train_no_na <- train %>%
na.omit()
imp_knn_blueprint <- recipe(Transported ~ ., data = train_no_na) %>%
step_impute_knn(recipe = ., HomePlanet,
impute_with = imp_vars(.), neighbors = 5,
options = list(nthread = 1, eps = 1e-08))
imp_knn_prep <- prep(imp_knn_blueprint, training = train_no_na)
imp_knn_5 <- bake(imp_knn_prep, new_data = train)
Is there some way to use the tidymodels
and parsnip
workflows to tune the knn-function that is used inside the step_impute_knn
? I've tried reading the code for the function but don't see which engine they use.
EDIT: To be clear, I'd like to tune the neighbours
parameter inside step_impute_knn
via some grid search, rather than having to do it manually.
You can tune()
neighbors in step_impute_knn
similarly to other hyperparameters in recipe steps.
library(tidymodels)
train_folds <- vfold_cv(train_no_na, v = 3)
imp_knn_blueprint <- recipe(Transported ~ ., data = train_no_na) %>%
step_impute_knn(HomePlanet,
impute_with = imp_vars(all_predictors()), neighbors = tune::tune(),
options = list(nthread = 1, eps = 1e-08))
log_spec <- logistic_reg()
# Update range as appropriate
knn_params <- extract_parameter_set_dials(imp_knn_blueprint) %>%
update(neighbors = neighbors(c(1L, 10L)))
knn_grid <- grid_regular(knn_params,
levels = c(
neighbors = 10
))
knn_wf <-
workflow() %>%
add_model(log_spec) %>%
add_recipe(imp_knn_blueprint)
impute_knn_tune <-
knn_wf %>%
tune_grid(
train_folds,
grid = knn_grid,
metrics = metric_set(roc_auc, accuracy)
)