Search code examples
rgtsummary

Consistent significant digits in summary table


I noticed the significant digits being approximated for the same variable may result into weird sums for percentages in tbl_summary when the data contains small sized categories.

Below the example:

library(gtsummary)

gtsummary::tbl_cross(data = gtsummary::trial[1:30,], row = trt, col = stage, 
                     percent = "cell")

Created on 2025-01-29 with reprex v2.1.1

output

In the output table it's quite clear that the sum of rows and columns percentage isn't equal to the one reported in "Total". For example Drug A total is 53.4% if I sum all the stage percentages.

In some cases this could lead to sums above 100% (for example 97.8% would be approximated to 98% and the second category would report 2.2%).

The issue seems to be fixable using digits = 1 but this also modifies digits for integers.

second output

I'm unable to figure out how to fine tune this aspect (either using function arguments or themes). Final objective would be to have the same amount of significant digits in all cells to make the sums accurate, for example keeping 1 significant digits in this case, or 0 significant digits using the full dataset (as categories are fairly large).

Any indications?


Solution

  • gtsummary::tbl_cross(data = gtsummary::trial[1:30,], row = trt, col = stage, 
                         percent = "cell", digits = c(0, 1))
    




    Based on your comment, it is not obvious what rule or function should be applied on the percentages (it's not that it is unclear what you want, but rather the logic seems paradoxical). Also, I don't particularly agree with using different precision for the same variable. That said, I included a very laborious way of getting close to what you're describing. Basically if the rounded total is equal to the sum of the rounded individuals, then we use the rounded values, otherwise we keep the decimals (although it's not very consistent, see further down).

    library(gtsummary)
    library(dplyr)
    library(tidyr)
    
    gtbl <- tbl_cross(data = gtsummary::trial[1:30,], row = trt, col = stage, 
                      percent = "cell", digits = c(0, 1))
    
    gtbl$table_body %>% 
      select(label, contains("stat")) %>% 
      separate_wider_regex(contains("stat"), 
                           c(v = ".*?", " \\(", p = ".*", "%\\)"), 
                           names_sep = "_") %>% 
      mutate(across(contains("stat"), ~as.numeric(.x))) %>% 
      mutate(tst_lgl = round(stat_0_p) == 
               rowSums(round(select(., matches("stat_[1-9]+_p"))))) %>% 
      mutate(across(contains("_p"), ~ifelse(tst_lgl, round(.x), .x)), 
             .keep = "unused") %>% 
      pivot_longer(-label, names_sep = "_(?=[^_]+$)", 
                           names_to = c("col", "name")) %>% 
      pivot_wider(id_cols = c(label, col)) %>% 
      mutate(value = ifelse(is.na(v), NA_character_, paste0(v, " (", p, "%)")), 
             .keep = "unused") %>% 
      pivot_wider(id_cols = label, names_from = col) %>% 
      right_join({gtbl$table_body %>% select(!contains("stat"))}, ., 
                 by = join_by(label)) -> gtbl$table_body
    
    gtbl
    

    While looking at the percentages row-wise we are compliant with the "rule" described above, if we look at the columns, then it quickly falls apart. My advice, just use 1 or 2 decimals consistently. But if you must, you are better off just manually tampering with the table body.