r automation statistics tidyverse data-analysis

Compute sensitivity, specificity, and more using multiple input variables in R

Preamble: the question I am going to ask can be considered a follow up of this discussion, for which a nice answer was provided. Also, I was given extremely helpful advice here, and the idea of what I am dealing with now goes into a similar direction.

I am creating a largely automated dashboard and, therefore, look for ways to generalise whenever possible. Here, I have a dataframe (in the long format, work mostly done with packages from the tidyverse) with

different methods (A, B, C, D, ...) called METHODEKURZ
two different outcome values (0, 1) pertaining to METHODEKURZ, I call them CLASS_INT
a set of comorbidities (COM1, COM2, COM3, COM4, COM5), sometimes more, sometimes less, called COMORB
two different outcome values (0, 1) pertaining to COMORB, I call them VALUES

Based on this information, I would like to obtain an output that looks like this:

METHODEKURZ	COMORB	Sensitivity	Specificity	PPV	NPV
A	COM1	0.49	0.22	0.31	0.11
B	COM1	0.31	0.22	0.22	0.49
C	COM1	0.22	0.49	0.31	0.22
D	COM1	0.49	0.22	0.31	0.11
A	COM2	0.22	0.22	0.49	0.11
B	COM2	0.49	0.22	0.31	0.22
C	COM2	0.31	0.22	0.31	0.22
D	COM2	0.31	0.22	0.31	0.49

If the question was solely to provide such an output with variable METHODEKURZ, the approach shown here and rendered below would be adequate and has shown to work well:

library(tidyverse)

my_df <- structure(
  list(
    a = c('A','B','C','D','A','B','C','D','A','B','C','D','A','B','C','D','A','B','C','D'), 
    b = c(0,0,1,1,0,1,0,0,0,0,1,1,1,1,1,1,1,1,1,0), 
    c = c('COM1','COM1','COM1','COM1','COM2','COM2','COM2','COM2','COM3','COM3','COM3','COM3', 'COM4','COM4','COM4','COM4','COM5','COM5','COM5','COM5'),
    d = c(1,1,0,0,0,1,0,0,1,0,1,0,0,0,1,1,1,0,1,1) 
  ), 
  .Names = c("METHODEKURZ", "CLASS_INT", "COMORB", "VALUES"), 
  row.names = c(NA, 20L), 
  class = "data.frame") %>%
  mutate(across(c(contains('VALUES')), 
                ~as.factor(.))) %>%
  mutate(across(c(contains('CLASS_INT')), 
                ~as.factor(.))) 

t(sapply(sort(unique(my_df$METHODEKURZ)), function(i) { 
  
  q <- confusionMatrix(data      = my_df$CLASS_INT[my_df$METHODEKURZ == i],
                       reference = my_df$VALUES[my_df$METHODEKURZ == i])$table
  
  c(sensitivity = q[1, 1] / (q[1, 1] + q[2, 1]),
    specificity = q[2, 2] / (q[2, 2] + q[1, 2]),
    ppv         = q[1, 1] / (q[1, 1] + q[1, 2]),
    npv         = q[2, 2] / (q[2, 2] + q[2, 1]))
}))

However, I have COMORB as an additional variable, which I would love to be taken into consideration. Could anybody help me modify the code in a way to include COMORB as a variable? I will use the output as a table but will likely also invest some time into finding a good way to visualise it. Thanks a lot for all your help in advance.

Solution

Store each combination of variables into a data frame using expand.grid and compute the statistics using the values corresponding to each individual set of variables.

library(caret)

# Generate all the combinations of variables using expand.grid
var_combinations <- expand.grid("METHODEKURZ" = unique(my_df$METHODEKURZ), 
                                "COMORB" = unique(my_df$COMORB))

cbind(var_combinations, t(apply(var_combinations, 1, function(i) {
  set_of_rows <- my_df$METHODEKURZ == i[1] & my_df$COMORB == i[2]
  q <- confusionMatrix(data      = my_df$CLASS_INT[set_of_rows],
                       reference = my_df$VALUES[set_of_rows])$table
  
  c(sensitivity = q[1, 1] / (q[1, 1] + q[2, 1]),
    specificity = q[2, 2] / (q[2, 2] + q[1, 2]),
    ppv         = q[1, 1] / (q[1, 1] + q[1, 2]),
    npv         = q[2, 2] / (q[2, 2] + q[2, 1]))
})))

#   METHODEKURZ COMORB sensitivity specificity       ppv       npv
#1            A   COM1   1.0000000   0.6666667 0.6666667 1.0000000
#2            B   COM1   1.0000000   0.2500000 0.2500000 1.0000000
#3            C   COM1   0.3333333   0.5000000 0.5000000 0.3333333
#4            D   COM1   0.0000000   0.3333333 0.0000000 0.3333333
#5            A   COM2   1.0000000   0.0000000 0.6000000       NaN
#6            B   COM2   0.0000000   0.5000000 0.0000000 0.6666667
#7            C   COM2   1.0000000   0.5000000 0.3333333 1.0000000
#8            D   COM2   0.2500000   0.0000000 0.5000000 0.0000000
#9            A   COM3   0.5000000   0.0000000 0.2500000 0.0000000
#10           B   COM3   1.0000000   0.2500000 0.2500000 1.0000000
#11           C   COM3   0.3333333   0.5000000 0.5000000 0.3333333
#12           D   COM3   0.5000000   0.0000000 0.6666667 0.0000000
#13           A   COM4   0.6666667   0.0000000 0.5000000 0.0000000
#14           B   COM4   1.0000000   0.5000000 0.3333333 1.0000000
#15           C   COM4   1.0000000   1.0000000 1.0000000 1.0000000
#16           D   COM4   0.5000000   0.3333333 0.3333333 0.5000000
#17           A   COM5   0.5000000   1.0000000 1.0000000 0.3333333
#18           B   COM5   0.0000000   0.7500000 0.0000000 0.7500000
#19           C   COM5   1.0000000   0.6666667 0.6666667 1.0000000
#20           D   COM5   0.5000000   0.0000000 0.6666667 0.0000000

Raw data

I generated more values to get several observations for each combination of variables.

library(dplyr)

#For reproducibility
set.seed(123)

my_df <- structure(
  list(
    a = rep(c('A','B','C','D'),length.out = 100), 
    b = sample(c(0,1),100, replace = TRUE), 
    c = c(rep('COM1',20),rep('COM2',20),rep('COM3',20),rep('COM4',20), rep('COM5',20)),
    d = sample(c(0,1),100, replace = TRUE)
  ), 
  .Names = c("METHODEKURZ", "CLASS_INT", "COMORB", "VALUES"), 
  row.names = c(NA, 100L), 
  class = "data.frame") %>%
  mutate(across(c(contains('VALUES')), 
                ~as.factor(.))) %>%
  mutate(across(c(contains('CLASS_INT')), 
                ~as.factor(.)))