I am struggling to summarise my cross-validation results in R. A grid search has been carried out across two parameters. I am using forward chaining cross-validation so have results across 3 years (2018:2020), and have 100 sub-models that I would like to summarise results across.
I've replicated the structure of my results here:
library(data.table)
# Set random seed for reproducibility
set.seed(42)
# Define unique sub-models, years, and hyperparameters
sub_models <- paste0("model_", sprintf("%03d", 1:100))
years <- 2018:2020
hyperparam_binary <- c(TRUE, FALSE)
hyperparam_numeric <- c(3, 4, 5)
# Create all combinations using data.table's CJ() (cross join)
dummy_data <- CJ(sub_model = sub_models,
year = years,
hyperparam_binary = hyperparam_binary,
hyperparam_numeric = hyperparam_numeric)
# Generate random performance metrics
dummy_data[, MAE := round(runif(.N, min = 1000, max = 5000), 2)]
dummy_data[, RMSE := round(runif(.N, min = 2000, max = 7000), 2)]
I would like to summarise these results by calculating the mean and standard deviation of MAE and RMSE for each combination of hyper-parameters. I tried doing this:
hyperparameter_cols <- c("hyperparam_binary", "hyperparam_numeric")
metric_cols <- c("MAE", "RMSE")
summary_results <- dummy_data[, lapply(.SD, function(x) list(mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE))),
by = hyperparameter_cols,
.SDcols = metric_cols]
But the resulting summary_results table isn't really what I want - it just seems to append mean and sd.
I also want the average rank for each parameter combination, where rank is assigned for each sub-model/year combination. I tried doing this:
ranked_results <- dummy_data[, lapply(.SD, function(x) rank(x, ties.method = "average")),
by = c("sub_model", "year"),
.SDcols = metric_cols]
which seems to work in providing ranks. But it gets rid of the parameter info, so I'm not sure how I can average them.
I feel like there must be a neater solution in data.table. Any help would be much appreciated!
You can try something like this:
f <- \(x) lapply(x, \(i) data.table(m = mean(i), s= sd(i)))
dummy_data[, do.call(cbind, f(.SD)), hyperparameter_cols,.SDcols = metric_cols]
Output:
hyperparam_binary hyperparam_numeric MAE.m MAE.s RMSE.m RMSE.s
<lgcl> <num> <num> <num> <num> <num>
1: FALSE 3 2978.239 1126.644 4563.335 1426.583
2: FALSE 4 2925.884 1199.488 4657.478 1543.213
3: FALSE 5 2895.942 1165.870 4477.693 1405.294
4: TRUE 3 2999.622 1201.470 4550.635 1469.597
5: TRUE 4 2988.928 1120.362 4487.522 1520.403
6: TRUE 5 2918.992 1194.360 4591.984 1469.488
For your second question about the ranks, you can assign those columns directly to dummy_data
using :=
. Specifically, do something like this:
rank_cols = paste0(metric_cols, "rank")
dummy_data[
,
c(rank_cols):=lapply(.SD, rank),
.(sub_model, year),
.SDcols = metric_cols
]
You could then include these rank_cols
when you call f()
, or you could work on them separately. An example of the former is:
dummy_data[, do.call(cbind, f(.SD)), hyperparameter_cols,.SDcols = c(metric_cols, rank_cols)]
Output:
hyperparam_binary hyperparam_numeric MAE.m MAE.s RMSE.m RMSE.s MAErank.m MAErank.s RMSErank.m RMSErank.s
<lgcl> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: FALSE 3 2978.239 1126.644 4563.335 1426.583 3.543333 1.640251 3.460000 1.658585
2: FALSE 4 2925.884 1199.488 4657.478 1543.213 3.440000 1.750222 3.663333 1.796500
3: FALSE 5 2895.942 1165.870 4477.693 1405.294 3.470000 1.743339 3.393333 1.633253
4: TRUE 3 2999.622 1201.470 4550.635 1469.597 3.523333 1.760622 3.530000 1.680828
5: TRUE 4 2988.928 1120.362 4487.522 1520.403 3.543333 1.640251 3.393333 1.797049
6: TRUE 5 2918.992 1194.360 4591.984 1469.488 3.480000 1.722253 3.560000 1.674039
An example of the latter is:
dummy_data[, lapply(.SD, mean), hyperparameter_cols, .SDcols = rank_cols]
Output:
hyperparam_binary hyperparam_numeric MAErank RMSErank
<lgcl> <num> <num> <num>
1: FALSE 3 3.543333 3.460000
2: FALSE 4 3.440000 3.663333
3: FALSE 5 3.470000 3.393333
4: TRUE 3 3.523333 3.530000
5: TRUE 4 3.543333 3.393333
6: TRUE 5 3.480000 3.560000