Search code examples
gtsummary

Easy way to show and compare means of normally vs. non-normally distributed distributions for continuous variables in gtsummary


I have a data frame with continuous and factor-type columns. I'm trying to build a summary table with gtsummary stratifying by a variable. My question is as follows:

  1. Is there a way to test all numerical variables at one to decide if their distribution is normal (shapiro.test() for example) using one of the apply family functions?
  2. After doing so, is there a way to tell gtsummary so show normally distributed data as mean(sd) and non-normally distributed data as median(IQR)?
  3. Can gtsummary decide whether to use methods for comparison of the means according to the distribution? (t test vs. Mann Whitney U test for example).

Thank you!

FC.


Solution

  • I would try something like this. I am using the Iris dataset as an example. To answer your first question I would use sapply and use the shapiro.test to get if the data is normally distributed. I used the p-value to determine if it was normally distributed but you can substitute your own criteria if there is something more appropriate. After the first step you have two vectors one specifying which variables are normally dist and ones that are not. then you can pass that vector to gtsmmary to tell it to modify the test and statistics for those variables. you do not need to pass it for the non-normally distributed variables bc that is the default.

    library(gtsummary)
    library(dplyr)
    
    normvals <- sapply(iris[sapply(iris, is.numeric)], function(x){
        normtest <- shapiro.test(x)
        #output pvalue
        normtest$p.value
    })
    
    notnorm <-  names(normvals[normvals <.05])
    
    norm <- names(normvals[normvals >= .05])
    
    
    irisdf <- filter(iris, Species != "setosa") %>% 
              mutate(Species = as.character(Species))
    
    
        
    tbl_summary(irisdf, 
              by = Species,
              statistic = list(all_of(norm) ~ "{mean} ({sd})")) %>% 
    add_p(
      test = list(all_of(norm) ~ "t.test"
      ))
    

    Edit: you can hard code the variables into the gtsummary call so you can make sure it works on the version that is on CRAN as of 9/22/2020:

    tbl_summary(irisdf, 
              by = Species,
              statistic = list(c('Sepal.Width', 'Sepal.Length') ~ "{mean} ({sd})")) %>% 
    add_p(
      test = list(c('Sepal.Width', 'Sepal.Length') ~ "t.test"
      ))