Search code examples
rdplyrcross-correlation

Tidy approach to assess the agreement among dichotomised/binary parameters


We start with a dataframe of the following form:

my_df <- structure(
  list(
    a = c(1, 1, 1, 2, 2, 2, 3, 3), 
    b = c('M1', 'M2', 'M3', 'M1', 'M2', 'M3', 'M1', 'M3'), 
    c = c(0, 0, 0, 1, 1, 0, 1, 1)
  ), 
  .Names = c("ID", "METHOD", "RESULT"), 
  row.names = c(NA, 8L), 
  class = "data.frame"
)

In this simplified example, we have three methods (M1, M2, M3), three indivdiuals (1,2,3, and for 3, there are only results for M1 and M3 available), and two possible test results, 0 (negative) and 1 (positive). I would like to get an output that looks as follows:

M1 positive M1 negative M2 positive M2 negative M3 positive M3 negative
if M1 positive 100% (XX/XX) 0% (XX/XX) % (XX/XX) % (XX/XX) % (XX/XX) % (XX/XX)
if M1 negative 0% (XX/XX) 100% (XX/XX) % (XX/XX) % (XX/XX) % (XX/XX) % (XX/XX)
if M2 positive % (XX/XX) % (XX/XX) 100% (XX/XX) 0% (XX/XX) % (XX/XX) % (XX/XX)
if M2 negative % (XX/XX) % (XX/XX) 0% (XX/XX) 100% (XX/XX) % (XX/XX) % (XX/XX)
if M3 positive % (XX/XX) % (XX/XX) % (XX/XX) % (XX/XX) 100% (XX/XX) 0% (XX/XX)
if M3 negative % (XX/XX) % (XX/XX) % (XX/XX) % (XX/XX) 0% (XX/XX) 100% (XX/XX)

Or in words, I would like to know how much the result of a method, say M1 for example, coincides with the result of another method. As an output, I would love to get (1) the percentages (say, if M1 positive, then 22% of M3 is also positive) and (2) the absolute numbers (e.g., in 1000 that are positive for M1, 220 were positive for M3). Thus, the (XX/XX) stands for, e.g., number of positives for M1/number of positives for M3, in this particular example.

I have taken several approaches, trying to harness ifelse, if_else, and case_when but I am looking for a generic way, preferentially somewhere situated in the tidyverse, of doing it that will provide the array -- a cross-correlation matrix in a way -- I have in mind. Any help on how to aggregate the data would be appreciated.


Solution

  • pacman::p_load(tidyverse, glue)
    
    name_vals <- c("negative", "positive")
    
    # join the data to itself, so every id, method, result row can see every other method and result with the same id
    full_join(my_df, my_df, by = "ID", relationship = "many-to-many") |>
      count(METHOD.x, METHOD.y, RESULT.x, RESULT.y) |>
    
      # we need to complete the data, because the empty values in our final table have to be created using glue, using values_fiill doesn't work
      complete(METHOD.x, METHOD.y, RESULT.x, RESULT.y, fill = list(n = 0)) |>
    
      # create all the columns we will need for pivoting
      mutate(condition = glue("if {METHOD.x} is {name_vals[RESULT.x + 1]}"),
            name = glue("{METHOD.y} {name_vals[RESULT.y + 1]}"),
            value = glue("{round(n * 100 / max(sum(n), 1), 2)}% ({n}/{sum(n)})"), # max(sum(n), 1) because when a condition doesn't exist in our dataset, (for example, M3 positive in someone who also had M2), sum(n) == 0, dividing by zero gives us NaNs, which you probably don't want
            .by = c(METHOD.x, METHOD.y, RESULT.x))|>
      pivot_wider(id_cols = condition)
    

    Output:

    # A tibble: 6 × 7
      condition         `M1 negative` `M1 positive` `M2 negative` `M2 positive` `M3 negative` `M3 positive`
      <glue>            <glue>        <glue>        <glue>        <glue>        <glue>        <glue>       
    1 if M1 is negative 100% (1/1)    0% (0/1)      100% (1/1)    0% (0/1)      100% (1/1)    0% (0/1)     
    2 if M1 is positive 0% (0/2)      100% (2/2)    0% (0/1)      100% (1/1)    50% (1/2)     50% (1/2)    
    3 if M2 is negative 100% (1/1)    0% (0/1)      100% (1/1)    0% (0/1)      100% (1/1)    0% (0/1)     
    4 if M2 is positive 0% (0/1)      100% (1/1)    0% (0/1)      100% (1/1)    100% (1/1)    0% (0/1)     
    5 if M3 is negative 50% (1/2)     50% (1/2)     50% (1/2)     50% (1/2)     100% (2/2)    0% (0/2)     
    6 if M3 is positive 0% (0/1)      100% (1/1)    0% (0/0)      0% (0/0)      0% (0/1)      100% (1/1)