Search code examples
rgtsummary

How can I create a cross-table by multiple variables in R?


I have a dummy dataset df which contains data on the treatment, disease grade and survival outcome of 50 people.

library(Hmisc)

# Generate initial cases.

set.seed(123)
n <- 50

df <- data.frame(
  treatment = sample(0:1, n, replace = TRUE),
  grade = sample(0:1, n, replace = TRUE),
  death = sample(0:1, n, replace = TRUE)
)

# Add labels to columns.

label(df$treatment) <- "Drug Intervention"
label(df$grade) <- "Cancer Grade"
label(df$death) <- "Death Occurrence"

# Factor the values.

df$treatment <- factor(df$treatment, levels = c(0, 1), labels = c("Drug A", "Drug B"))
df$grade <- factor(df$grade, levels = c(0, 1), labels = c("Grade 1", "Grade 2"))
df$death <- factor(df$death, levels = c(0, 1), labels = c("Survived", "Died"))

I want to generate a 2 x 3 cross table which shows the number of people who survived by grade and treatment. I am using tbl_strata() and tbl_summary() from gtsummary to do this.

This code is getting close to the desired outcome:

library(tidyverse)
library(gtsummary)

df %>% 
  tbl_strata(
    strata = grade,
    ~.x %>%
      tbl_summary(
        by = death,
        percent = "row"
      ))

Which produces a plot that looks like this:

Characteristic Grade 1 Survived, N = 10 Grade 1 Died, N = 17 Grade 2 Survived, N = 9 Grade 2 Died, N = 14
Treatment
Drug A 5 (28%) 13 (72%) 5 (42%) 7 (58%)
Drug B 5 (56%) 4 (44%) 4 (36%) 7 (64%)

However the desired output is:

Treatment Grade 1 Died Grade 2 Died P-Value
Drug A 13 (72%) 7 (58%) 0.46
Drug B 4 (44%) 7 (64%) 0.65

How can I use gtsummary to filter out/collapse the 'Survived` columns to simplify the table, and is it possible to add a Fisher's exact p-value for the relationship between death and grade (for the two separate drugs)?


Solution

  • There are multiple ways to accomplish this. You could filter out individuals who survived since you're only interested in death occurrences, then use tbl_strata() to stratify by treatment rather than grade. Finally, inside tbl_strata(), you can use tbl_summary() to summarize by grade and calculate percentages and use add_p() for your p-values (or whichever method of p-value calculation you prefer).

    Here is an example of the structure I am thinking of:

    df_filtered <- df %>% 
      filter(death == "Died")
    table_output <- df_filtered %>%
      tbl_strata(
        strata = treatment,
        .f = ~ .x %>%
          tbl_summary(
            by = grade,
            missing = "no"
          ) %>%
          add_n() %>% # Add counts
          modify_header(label = "**Grade**") %>% # Modify the header
          add_p(test = list(all_continuous() ~ "fisher.test")) # Add Fisher's exact test p-value
      )
    print(table_output)
    

    Another option is to use tbl_strata(strata = treatment) to stratify by treatment, and create separate tables for Drug A and Drug B.

    To get your count and percentage format, you can use statistic = list(all_categorical() ~ "{n} ({p}%)").

    And you can specify Fisher's exact explicitly as well: add_p(test = all_categorical() ~ "fisher").

    This way, your code might look something like this:

    df %>%
      tbl_strata(
        strata = treatment,
        ~.x %>%
          tbl_summary(
            by = grade,
            statistic = list(all_categorical() ~ "{n} ({p}%)"),
            missing = "no"
          ) %>%
          add_p(test = all_categorical() ~ "fisher.test")
      )
    

    You can further modify various aspects of formatting like this:

       %>% modify_header(stat_by = "Grade")
       %>% as_gt() %>%  
           tbl_caption(
             caption =  "Captions"
           )