Search code examples
rcategorical-data

Comparing groups when variable has multiple occurences per case


First, I admit, I've asked this elsewhere (Crossvalidated), but I guess it is not necessarily the same people reading both forums.

I'm finishing a manuscript, but have one nagging problem which I'm unable to find an answer.

Study has two groups which receive different procedure for a disease. After the procedure, one patient may face multiple different adverse events (complications). I want to test if there is a difference in total number of adverse events between groups.

Obviously it would be all too easy with a simple single categorical variable, but I'm in doubt if using chi-squared or Fisher's test is the right approach. If I'm not mistaken, their presumptions differ as this is not simple on/off per case-situation.

One idea I got was to create a continuous dummy variable for total number of complications per case and compare those with t-test or something, but it still answers a bit different question imho.

The exact problem without reported answer can be found for example from the CHOCOLATE-trial (https://www.bmj.com/content/363/bmj.k3965.long) table 3, last 3 rows.

Here is a reproducible code of the end result.

# Values are strored in vectors where each level means if a complication occured. Level 0 means there was none
# All variables have similar levels
# If variable compl_1 would have level 0 (there was no complication on that case), 
# variables compl_2 and compl_3 would also have in real dataset.
# This is not true for the following reproducible example,
# but it hopefully still delineates the problem.
# Here we have a sample size of 200 cases, 100 for each group

data <- data.frame(group = as.factor(ifelse(runif(200) > 0.5, 0, 1)),
                   compl_1 = c(rep(0, 140), rep(1, 40), rep(2, 14), rep(3,4), rep(4,2)), 
                   compl_2 = c(rep(0, 182), rep(1, 10), rep(2, 7), rep(3,1), rep(4,0)), 
                   compl_3 = c(rep(0, 187), rep(1, 5), rep(2, 7), rep(3,1), rep(4,0)))

# Then we count totals for each complication type by pivoting long, doing the count and pivoting back
table <- data |> 
  pivot_longer(compl_1:compl_3) |> 
  count(group, value) |> 
  pivot_wider(names_from = group, values_from = n, names_sort = T, values_fill = 0)

# Of course count for level 0 is bonkers as there were 100 cases per group
# So this is fixed by replacing it with the count of 0-levels in compl_1
no_complication <- data |> 
  dplyr::select(group, compl_1) |> 
  count(group, compl_1) |> 
  pivot_wider(names_from = group, values_from = n, names_sort = T, values_fill = 0) |> 
  slice_head(n=1)
table[1,] <- no_complication[1,]

# Next we add a row that contains the total number of complication (occurences of levels 1-4 for compl-variables)
complications_total <- table |> 
  slice(-1) |> 
  mutate(`0` = sum(`0`, na.rm = T),
         `1` = sum(`1`, na.rm = T)) |> 
  slice(1)

#To make end result clearer, let's add some names to cases
complications_total$value <- factor(complications_total$value, labels = "Total complications")
table$value <- factor(table$value, labels = c("No complication",
                                              "Head cut",
                                              "Foot off",
                                              "Hand torn",
                                              "Eyes blown"))

#Then we add the row of totals and give names to operations
table <- table |> 
  add_row(complications_total[1,]) |> 
  arrange(desc(`0`)) |> 
  rename(Henry = `0`,
         Martin = `1`)

So in other words, I want to add a new column with p-values and with the correct statistical test. Comparing group "No complications" is obviously easy, but for others I can't come up with anything.

Edit: After some extra thought, I think I've found a solution. Was probably overthinking this. I'll just take case based mean of each complication and compare those. I guess this is just a variant of what Brian suggested, so I'll flag that as the solution. Thanks for both!


Solution

  • As done in the paper, you could calculate the relative risk of a complication in the groups if you just wish to collapse all complications.

    library(tidyverse)
    library(epitools)
    rr_res <- table %>% # From your final 'table' data above
      # Create contingency table, collapsing all complications into one group
      mutate(value = factor(ifelse(value == "No complication", "No complication", "Complication"),
                            # Order for risk of complication
                            levels = c("No complication", "Complication"))) %>% 
      group_by(value) %>% 
      summarise(Henry = sum(Henry), Martin = sum(Martin)) %>% 
      select(-value) %>% 
      as.matrix() %>% 
      t() %>% # Conditions in rows, outcome in columns
      riskratio.wald()
    rr_res$measure[2,] %>% round(2)
    

    produces risk ratio with 95%CI of being in Martin group and developing complication.

    estimate    lower    upper 
        0.82     0.67     0.99 
    

    You might also produce odds-ratio with test.