Search code examples
rdata.tablecross-validation

Summarising cross-validation metrics using data.table R


I am struggling to summarise my cross-validation results in R. A grid search has been carried out across two parameters. I am using forward chaining cross-validation so have results across 3 years (2018:2020), and have 100 sub-models that I would like to summarise results across.

I've replicated the structure of my results here:

library(data.table)

# Set random seed for reproducibility
set.seed(42)

# Define unique sub-models, years, and hyperparameters
sub_models <- paste0("model_", sprintf("%03d", 1:100))
years <- 2018:2020
hyperparam_binary <- c(TRUE, FALSE)
hyperparam_numeric <- c(3, 4, 5)

# Create all combinations using data.table's CJ() (cross join)
dummy_data <- CJ(sub_model = sub_models,
                 year = years,
                 hyperparam_binary = hyperparam_binary,
                 hyperparam_numeric = hyperparam_numeric)

# Generate random performance metrics
dummy_data[, MAE := round(runif(.N, min = 1000, max = 5000), 2)]
dummy_data[, RMSE := round(runif(.N, min = 2000, max = 7000), 2)]

I would like to summarise these results by calculating the mean and standard deviation of MAE and RMSE for each combination of hyper-parameters. I tried doing this:

hyperparameter_cols <- c("hyperparam_binary", "hyperparam_numeric")

metric_cols <- c("MAE", "RMSE")

summary_results <- dummy_data[, lapply(.SD, function(x) list(mean = mean(x, na.rm = TRUE), 
                                                             sd = sd(x, na.rm = TRUE))),
                              by = hyperparameter_cols,
                              .SDcols = metric_cols]

But the resulting summary_results table isn't really what I want - it just seems to append mean and sd.

I also want the average rank for each parameter combination, where rank is assigned for each sub-model/year combination. I tried doing this:

ranked_results <- dummy_data[, lapply(.SD, function(x) rank(x, ties.method = "average")), 
                             by = c("sub_model", "year"), 
                             .SDcols = metric_cols]

which seems to work in providing ranks. But it gets rid of the parameter info, so I'm not sure how I can average them.

I feel like there must be a neater solution in data.table. Any help would be much appreciated!


Solution

  • You can try something like this:

    f <- \(x) lapply(x, \(i) data.table(m = mean(i), s= sd(i)))
    
    dummy_data[, do.call(cbind, f(.SD)), hyperparameter_cols,.SDcols = metric_cols]
    

    Output:

       hyperparam_binary hyperparam_numeric    MAE.m    MAE.s   RMSE.m   RMSE.s
                  <lgcl>              <num>    <num>    <num>    <num>    <num>
    1:             FALSE                  3 2978.239 1126.644 4563.335 1426.583
    2:             FALSE                  4 2925.884 1199.488 4657.478 1543.213
    3:             FALSE                  5 2895.942 1165.870 4477.693 1405.294
    4:              TRUE                  3 2999.622 1201.470 4550.635 1469.597
    5:              TRUE                  4 2988.928 1120.362 4487.522 1520.403
    6:              TRUE                  5 2918.992 1194.360 4591.984 1469.488
    

    For your second question about the ranks, you can assign those columns directly to dummy_data using :=. Specifically, do something like this:

    rank_cols = paste0(metric_cols, "rank")
    dummy_data[
      ,
      c(rank_cols):=lapply(.SD, rank),
      .(sub_model, year),
      .SDcols = metric_cols
    ]
    

    You could then include these rank_cols when you call f(), or you could work on them separately. An example of the former is:

    dummy_data[, do.call(cbind, f(.SD)), hyperparameter_cols,.SDcols = c(metric_cols, rank_cols)]
    

    Output:

       hyperparam_binary hyperparam_numeric    MAE.m    MAE.s   RMSE.m   RMSE.s MAErank.m MAErank.s RMSErank.m RMSErank.s
                  <lgcl>              <num>    <num>    <num>    <num>    <num>     <num>     <num>      <num>      <num>
    1:             FALSE                  3 2978.239 1126.644 4563.335 1426.583  3.543333  1.640251   3.460000   1.658585
    2:             FALSE                  4 2925.884 1199.488 4657.478 1543.213  3.440000  1.750222   3.663333   1.796500
    3:             FALSE                  5 2895.942 1165.870 4477.693 1405.294  3.470000  1.743339   3.393333   1.633253
    4:              TRUE                  3 2999.622 1201.470 4550.635 1469.597  3.523333  1.760622   3.530000   1.680828
    5:              TRUE                  4 2988.928 1120.362 4487.522 1520.403  3.543333  1.640251   3.393333   1.797049
    6:              TRUE                  5 2918.992 1194.360 4591.984 1469.488  3.480000  1.722253   3.560000   1.674039
    

    An example of the latter is:

    dummy_data[, lapply(.SD, mean), hyperparameter_cols, .SDcols = rank_cols]
    

    Output:

       hyperparam_binary hyperparam_numeric  MAErank RMSErank
                  <lgcl>              <num>    <num>    <num>
    1:             FALSE                  3 3.543333 3.460000
    2:             FALSE                  4 3.440000 3.663333
    3:             FALSE                  5 3.470000 3.393333
    4:              TRUE                  3 3.523333 3.530000
    5:              TRUE                  4 3.543333 3.393333
    6:              TRUE                  5 3.480000 3.560000