Search code examples
rtidymodelstidyclust

Tune K for kmeans clustering using Tidymodels, Workflowsets and Recipes


I wish to select the optimal value of K for K means clustering using Tidymodels. I'm exploring using workflowsets to provide a number of preprocessing recipes and I wish to compare their performance when selecting a value for K.

I'm trying to follow along to this tutorial in combination with this one which discusses using workflowsets for comparing models.

I'm using the mtcars data and I've been stuck on the hyper-parameter tuning section where I'm trying to collect the tuning results. See my code below.

I'm stuck at the section where I have a workflowset and am passing it to tune_cluster. I receive the following error:

Error in tune_cluster(): The first argument to [tune_cluster()] should be either a model or workflow.

# INSTALL PACKAGES
pacman::p_load(tidyverse, tidymodels, tidyclust, janitor, ClusterR, knitr, moments, visdat, skimr, DescTools)

mtcars <- mtcars %>%
  mutate(
    `am` = factor(`am`, labels = c(`0` = "auto", `1` = "man")),
    `vs` = factor(`vs`, labels = c(`0` = "V-shaped", `1` = "straight")),
    `cyl` = factor(`cyl`),
    `gear` = factor(`gear`),
    `carb` = factor(`carb`)
  )

# SET UP 10 FOLD CROSS VALIDATION
mtcars_cv <- vfold_cv(mtcars, v = 10)

# SET SEED FOR REPRODUCABILITY
set.seed(123)


# EDA ---------------------------------------------------------------------

#skimr::skim(mtcars)

#DescTools::Desc(mtcars)


# MODEL SPEC --------------------------------------------------------------

kmeans_spec <- k_means(num_clusters = tune())


# PREPROCESSING RECIPES ---------------------------------------------------

rec1 <- recipe(~., data = mtcars) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

rec2 <- recipe(~., data = mtcars) %>%
  step_novel(all_nominal()) %>%
  step_dummy(all_nominal()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors(), num_comp = 2)

rec3 <- recipe(~., data = mtcars) %>%
  step_log(all_numeric_predictors()) %>%
  step_center(all_numeric_predictors()) %>%
  step_scale(all_numeric_predictors())


clust_num_grid <- grid_regular(num_clusters(),
  levels = 10
)

# WORKFLOW ----------------------------------------------------------------

wf_set <- workflow_set(
  preproc = list(rec1, rec2, rec3),
  models = list(kmeans_spec)
)


# TUNE HYPER-PARAMETERS ---------------------------------------------------

tune_results <- wf_set %>%
  workflow_map(
    resamples = mtcars_cv,
    grid = clust_num_grid
  ) %>%
  tune_cluster(
    resamples = mtcars_cv,
    grid = clust_num_grid,
    metrics = cluster_metric_set(sse_within_total, sse_total, sse_ratio),
    control = tune::control_grid(save_pred = TRUE, extract = identity)
  )

best_wf <- tune_results %>%
  select_best("sse_ratio")

Any help to overcome this issue would be greatly appreciated.


Solution

  • Thanks for the post! Emil (tidyclust maintainer) and I (workflowsets maintainer) just chatted about this.

    The workflowsets idiom for tuning these models would look like:

    tune_results <-
       wf_set %>% 
       workflow_map(
          "tune_cluster",
          resamples = mtcars_cv,
          grid = clust_num_grid,
          metrics = cluster_metric_set(sse_within_total, sse_total, sse_ratio),
          control = tune::control_grid(save_pred = TRUE, extract = identity)
       )
    

    ...but workflowsets currently prevents you from passing "tune_cluster" as your tuning function. I've filed an issue on the package repository to remind myself to add support for this.

    In the meantime, you could approximate this process with something like:

    tune_cluster_wf <- function(id) {
       tune_cluster(
          extract_workflow(wf_set, id),
          resamples = mtcars_cv,
          grid = clust_num_grid,
          metrics = cluster_metric_set(sse_within_total, sse_total, sse_ratio),
          control = tune::control_grid(save_pred = TRUE, extract = identity)
       )
    }
    
    wf_set$result <- lapply(wf_set$wflow_id, tune_cluster_wf)
    

    The third element throws an error from the recipe, but I'll let you troubleshoot from there. :)