Search code examples
rautomationstatisticstidyversedata-analysis

Compute sensitivity, specificity, and more using multiple input variables in R


Preamble: the question I am going to ask can be considered a follow up of this discussion, for which a nice answer was provided. Also, I was given extremely helpful advice here, and the idea of what I am dealing with now goes into a similar direction.

I am creating a largely automated dashboard and, therefore, look for ways to generalise whenever possible. Here, I have a dataframe (in the long format, work mostly done with packages from the tidyverse) with

  • different methods (A, B, C, D, ...) called METHODEKURZ

  • two different outcome values (0, 1) pertaining to METHODEKURZ, I call them CLASS_INT

  • a set of comorbidities (COM1, COM2, COM3, COM4, COM5), sometimes more, sometimes less, called COMORB

  • two different outcome values (0, 1) pertaining to COMORB, I call them VALUES

Based on this information, I would like to obtain an output that looks like this:

METHODEKURZ COMORB Sensitivity Specificity PPV NPV
A COM1 0.49 0.22 0.31 0.11
B COM1 0.31 0.22 0.22 0.49
C COM1 0.22 0.49 0.31 0.22
D COM1 0.49 0.22 0.31 0.11
A COM2 0.22 0.22 0.49 0.11
B COM2 0.49 0.22 0.31 0.22
C COM2 0.31 0.22 0.31 0.22
D COM2 0.31 0.22 0.31 0.49

If the question was solely to provide such an output with variable METHODEKURZ, the approach shown here and rendered below would be adequate and has shown to work well:

library(tidyverse)

my_df <- structure(
  list(
    a = c('A','B','C','D','A','B','C','D','A','B','C','D','A','B','C','D','A','B','C','D'), 
    b = c(0,0,1,1,0,1,0,0,0,0,1,1,1,1,1,1,1,1,1,0), 
    c = c('COM1','COM1','COM1','COM1','COM2','COM2','COM2','COM2','COM3','COM3','COM3','COM3', 'COM4','COM4','COM4','COM4','COM5','COM5','COM5','COM5'),
    d = c(1,1,0,0,0,1,0,0,1,0,1,0,0,0,1,1,1,0,1,1) 
  ), 
  .Names = c("METHODEKURZ", "CLASS_INT", "COMORB", "VALUES"), 
  row.names = c(NA, 20L), 
  class = "data.frame") %>%
  mutate(across(c(contains('VALUES')), 
                ~as.factor(.))) %>%
  mutate(across(c(contains('CLASS_INT')), 
                ~as.factor(.))) 

t(sapply(sort(unique(my_df$METHODEKURZ)), function(i) { 
  
  q <- confusionMatrix(data      = my_df$CLASS_INT[my_df$METHODEKURZ == i],
                       reference = my_df$VALUES[my_df$METHODEKURZ == i])$table
  
  c(sensitivity = q[1, 1] / (q[1, 1] + q[2, 1]),
    specificity = q[2, 2] / (q[2, 2] + q[1, 2]),
    ppv         = q[1, 1] / (q[1, 1] + q[1, 2]),
    npv         = q[2, 2] / (q[2, 2] + q[2, 1]))
}))

However, I have COMORB as an additional variable, which I would love to be taken into consideration. Could anybody help me modify the code in a way to include COMORB as a variable? I will use the output as a table but will likely also invest some time into finding a good way to visualise it. Thanks a lot for all your help in advance.


Solution

  • Store each combination of variables into a data frame using expand.grid and compute the statistics using the values corresponding to each individual set of variables.

    library(caret)
    
    # Generate all the combinations of variables using expand.grid
    var_combinations <- expand.grid("METHODEKURZ" = unique(my_df$METHODEKURZ), 
                                    "COMORB" = unique(my_df$COMORB))
    
    cbind(var_combinations, t(apply(var_combinations, 1, function(i) {
      set_of_rows <- my_df$METHODEKURZ == i[1] & my_df$COMORB == i[2]
      q <- confusionMatrix(data      = my_df$CLASS_INT[set_of_rows],
                           reference = my_df$VALUES[set_of_rows])$table
      
      c(sensitivity = q[1, 1] / (q[1, 1] + q[2, 1]),
        specificity = q[2, 2] / (q[2, 2] + q[1, 2]),
        ppv         = q[1, 1] / (q[1, 1] + q[1, 2]),
        npv         = q[2, 2] / (q[2, 2] + q[2, 1]))
    })))
    
    #   METHODEKURZ COMORB sensitivity specificity       ppv       npv
    #1            A   COM1   1.0000000   0.6666667 0.6666667 1.0000000
    #2            B   COM1   1.0000000   0.2500000 0.2500000 1.0000000
    #3            C   COM1   0.3333333   0.5000000 0.5000000 0.3333333
    #4            D   COM1   0.0000000   0.3333333 0.0000000 0.3333333
    #5            A   COM2   1.0000000   0.0000000 0.6000000       NaN
    #6            B   COM2   0.0000000   0.5000000 0.0000000 0.6666667
    #7            C   COM2   1.0000000   0.5000000 0.3333333 1.0000000
    #8            D   COM2   0.2500000   0.0000000 0.5000000 0.0000000
    #9            A   COM3   0.5000000   0.0000000 0.2500000 0.0000000
    #10           B   COM3   1.0000000   0.2500000 0.2500000 1.0000000
    #11           C   COM3   0.3333333   0.5000000 0.5000000 0.3333333
    #12           D   COM3   0.5000000   0.0000000 0.6666667 0.0000000
    #13           A   COM4   0.6666667   0.0000000 0.5000000 0.0000000
    #14           B   COM4   1.0000000   0.5000000 0.3333333 1.0000000
    #15           C   COM4   1.0000000   1.0000000 1.0000000 1.0000000
    #16           D   COM4   0.5000000   0.3333333 0.3333333 0.5000000
    #17           A   COM5   0.5000000   1.0000000 1.0000000 0.3333333
    #18           B   COM5   0.0000000   0.7500000 0.0000000 0.7500000
    #19           C   COM5   1.0000000   0.6666667 0.6666667 1.0000000
    #20           D   COM5   0.5000000   0.0000000 0.6666667 0.0000000
    

    Raw data

    I generated more values to get several observations for each combination of variables.

    library(dplyr)
    
    #For reproducibility
    set.seed(123)
    
    my_df <- structure(
      list(
        a = rep(c('A','B','C','D'),length.out = 100), 
        b = sample(c(0,1),100, replace = TRUE), 
        c = c(rep('COM1',20),rep('COM2',20),rep('COM3',20),rep('COM4',20), rep('COM5',20)),
        d = sample(c(0,1),100, replace = TRUE)
      ), 
      .Names = c("METHODEKURZ", "CLASS_INT", "COMORB", "VALUES"), 
      row.names = c(NA, 100L), 
      class = "data.frame") %>%
      mutate(across(c(contains('VALUES')), 
                    ~as.factor(.))) %>%
      mutate(across(c(contains('CLASS_INT')), 
                    ~as.factor(.)))