Search code examples
rtidymodelstidyclust

Collect metrics from a cross-validated tidyclust workflowset


I'm trying to plot the results (K Vs. Sum of Square Ratio) for my cross-validated workflowset but when I use collect_metrics I'm not seeing a standalone columns for K which can be plotted against mean error.

collected metrics

I've parsed out the value for K from the config name but I'm not sure this is a valid approach:

tune_results <- wf_set %>%
  collect_metrics() %>%
  filter(.metric == "sse_ratio")

tune_results %>%
  ggplot(aes(x = as.numeric(stringr::str_sub(.config, -2, -1)), y = mean, color = wflow_id)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  ggtitle("Plot of WSS/TSS ratio by Cluster Number") +
  ylab("mean WSS/TSS ratio, over 10 folds") +
  xlab("Number of clusters") +
  scale_x_continuous(breaks = 1:10)

Here's a reprex for the example I'm working through, I've changed the CV folds to 3 to speed up the compute time:

if (!requireNamespace("pacman", quietly = TRUE)) {
  message("Installing pacman...")
  install.packages("pacman")
}

#INSTALL PACKAGES
pacman::p_load(tidyverse, tidymodels, tidyclust, janitor, ClusterR, knitr, moments, visdat, skimr, DescTools)

mtcars <- mtcars %>%
  mutate(
    `am` = factor(`am`, labels = c(`0` = "auto", `1` = "man")),
    `vs` = factor(`vs`, labels = c(`0` = "V-shaped", `1` = "straight")),
    `cyl` = factor(`cyl`),
    `gear` = factor(`gear`),
    `carb` = factor(`carb`)
  )

# SET UP 10 FOLD CROSS VALIDATION
mtcars_cv <- vfold_cv(mtcars, v = 3)

# SET SEED FOR REPRODUCABILITY
set.seed(123)


# EDA ---------------------------------------------------------------------

# skimr::skim(mtcars)

# DescTools::Desc(mtcars)


# MODEL SPEC --------------------------------------------------------------

kmeans_spec <- k_means(num_clusters = tune())


# PREPROCESSING RECIPES ---------------------------------------------------

rec1 <- recipe(~., data = mtcars) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

rec2 <- recipe(~., data = mtcars) %>%
  step_novel(all_nominal()) %>%
  step_dummy(all_nominal()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors(), num_comp = 2)

rec3 <- recipe(~ ., data = mtcars) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_center(all_numeric())

clust_num_grid <- grid_regular(num_clusters(),
  levels = 10
)

# WORKFLOW ----------------------------------------------------------------

wf_set <- workflow_set(
  preproc = list(rec1, rec2, rec3),
  models = list(kmeans_spec)
)

# TUNE HYPER-PARAMETERS ---------------------------------------------------

tune_cluster_wf <- function(id) {
  tune_cluster(
    extract_workflow(wf_set, id),
    resamples = mtcars_cv,
    grid = clust_num_grid,
    metrics = cluster_metric_set(sse_within_total, sse_total, sse_ratio),
    control = tune::control_grid(save_pred = TRUE, extract = identity)
  )
}

wf_set$result <- map(wf_set$wflow_id, tune_cluster_wf)

tune_results <- wf_set %>%
  collect_metrics() %>%
  filter(.metric == "sse_ratio")

tune_results %>%
  ggplot(aes(x = as.numeric(stringr::str_sub(.config, -2, -1)), y = mean, color = wflow_id)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  ggtitle("Plot of WSS/TSS ratio by Cluster Number") +
  ylab("mean WSS/TSS ratio, over 10 folds") +
  xlab("Number of clusters") +
  scale_x_continuous(breaks = 1:10)

Solution

  • The documentation for the workflow_set collect_metrics() method might be helpful here.

    It reads:

    When applied to a workflow set, the metrics and predictions that are returned do not contain the actual tuning parameter columns and values (unlike when these collect functions are run on other objects). The reason is that workflow sets can contain different types of models or models with different tuning parameters.

    If the columns are needed, there are two options. First, the .config column can be used to merge the tuning parameter columns into an appropriate object. Alternatively, the map() function can be used to get the metrics from the original objects (see the example below).

    The example contains code demonstrating how to map() through the tuning results and extracting k for each in the process.

    Note that the output of as.numeric(stringr::str_sub(.config, -2, -1) is not the value of k for that model, but is the unique identifier for that model/preprocess combination.