Search code examples
rloopsknitrsummaryxtable

Time Saver on Summary Stats


The goal is NOT cross tabulation. I am aiming for a table with descriptive statistics (in this case, counts & percentages) for all the variables in a dataset so that I do not need to type table(name of variable) approx. 20 times while working on a new dataset. I plan on just exporting this using xtable + knitr. Unfortunately, there is something wrong with the loop. Any help would be appreciated. This question is very relevant to people who repeatedly need to present summary statistics as this would save a lot of time. Sorry for asking this a 3rd time :) I have heard of the table() function - in fact, I've used it several times below In a nutshell, this function presents the names of variables, their levels, the counts & percentages without making someone type in table() for each variable in a dataset-

ESRD <- rep(c("Y", "N"), each=10)
DIABETES <- rep(c("Y", "N", "Y", "N"), c(5, 5, 5, 5))
BLAH <- rep(c("Y", "N"), each=10)
categoricalvariables <- data.frame(ESRD, DIABETES, BLAH)

descriptives <- function(VARIABLEMATRIX){
desc <- matrix(0, ncol=4, nrow=2*ncol(VARIABLEMATRIX) + ncol(VARIABLEMATRIX))
  for (i in 1:ncol(VARIABLEMATRIX)){
    matper <- matrix(0, nrow=dim(table(VARIABLEMATRIX[ ,i])), ncol=1)
    for (i in 1:dim(table(VARIABLEMATRIX[ ,i]))){
      matper[i, ] <- paste(round(prop.table(table(VARIABLEMATRIX[ ,i]))[i]*100, 2), "%")
    }

    matcount <- matrix(0, nrow=dim(table(VARIABLEMATRIX[ ,i])), ncol=1)
    for (i in 1:dim(table(VARIABLEMATRIX[ ,i]))){
      matcount[i, ] <- table(VARIABLEMATRIX[ ,i])[i]
    }

    desc[((3*i)-2), ] <- c(colnames(VARIABLEMATRIX)[i], "", "", "") 
    desc[((3*i)-1):(3*i), ] <- cbind("", names(table(VARIABLEMATRIX[ ,i])), matcount[ ,1], matper[ ,1])
return(desc)
}
}
descriptives(categoricalvariables)

The output I am getting is (clearly there is a bug but I am not sure what is wrong):

     [,1]       [,2] [,3] [,4]  
 [1,] "0"        "0"  "0"  "0"   
 [2,] "0"        "0"  "0"  "0"   
 [3,] "0"        "0"  "0"  "0"   
 [4,] "DIABETES" ""   ""   ""    
 [5,] ""         "N"  "10" "50 %"
 [6,] ""         "Y"  "10" "50 %"
 [7,] "0"        "0"  "0"  "0"   
 [8,] "0"        "0"  "0"  "0"   
 [9,] "0"        "0"  "0"  "0"  

The expected output should be:

     [,1]       [,2] [,3] [,4]  
 [1,] "ESRD"     ""   ""   ""     
 [2,] ""         "N"  "10" "50 %" 
 [3,] ""         "Y"  "10" "50 %"   
 [4,] "DIABETES" ""   ""   ""    
 [5,] ""         "N"  "10" "50 %"
 [6,] ""         "Y"  "10" "50 %"
 [7,] "BLAH"     ""   ""   ""     
 [8,] ""         "N"  "10" "50 %"  
 [9,] ""         "Y"  "10" "50 %"

Solution

  • Here are some options using tidyverse functions:

    library(tidyverse)
    
    categoricalvariables %>% 
      gather(Measure, Value) %>%
      group_by(Measure, Value) %>%
      tally %>%
      mutate(Percent=n/sum(n))
    
       Measure Value     n Percent
    1     BLAH     N    10     0.5
    2     BLAH     Y    10     0.5
    3 DIABETES     N    10     0.5
    4 DIABETES     Y    10     0.5
    5     ESRD     N    10     0.5
    6     ESRD     Y    10     0.5
    
    categoricalvariables %>% 
      gather(Measure, Value) %>%
      group_by(Measure, Value) %>%
      tally %>%
      mutate(Percent=n/sum(n)) %>%
      gather(Stats, Value2, -Measure, -Value) %>%
      unite(Value_Stats, Stats, Value) %>%
      spread(Value_Stats, Value2)
    
       Measure   n_N   n_Y Percent_N Percent_Y
    1     BLAH    10    10       0.5       0.5
    2 DIABETES    10    10       0.5       0.5
    3     ESRD    10    10       0.5       0.5
    

    I've left the data in numeric format in case you want to do further processing.

    To set things up for an exported table, maybe something like this:

    tab = categoricalvariables %>% 
      gather(Measure, Value) %>%
      group_by(Measure, Value) %>%
      summarise(Count=n()) %>%
      mutate(Percent=paste0(sprintf("%1.1f", Count/sum(Count)*100),"%")) %>%
      ungroup %>%
      mutate(Measure = ifelse(duplicated(Measure),"", Measure))
    
       Measure Value Count Percent
    1     BLAH     N    10   50.0%
    2              Y    10   50.0%
    3 DIABETES     N    10   50.0%
    4              Y    10   50.0%
    5     ESRD     N    10   50.0%
    6              Y    10   50.0%
    

    Now you can run xtable on tab.

    library(xtable)
    
    print(xtable(tab, align="llcrr"), include.rownames=FALSE)
    

    Which looks like this when output from an rmarkdown document to PDF:

    enter image description here

    If you have numeric columns that you want to summarise, you can do, for example (using the built-in iris data frame):

    iris %>% group_by(Species) %>%
      summarise_all(funs(mean, min, max)) %>%
      gather(key, value, -Species) %>%
      separate(key, c("Measure","Stat"),"_") %>%
      spread(Stat, value)
    

    You'd probably want to reshape this further or reformat for output to a table, but it gives you an idea of what's possible.

          Species      Measure   max  mean   min
    1      setosa Petal.Length   1.9 1.462   1.0
    2      setosa  Petal.Width   0.6 0.246   0.1
    3      setosa Sepal.Length   5.8 5.006   4.3
    4      setosa  Sepal.Width   4.4 3.428   2.3
    5  versicolor Petal.Length   5.1 4.260   3.0
    6  versicolor  Petal.Width   1.8 1.326   1.0
    7  versicolor Sepal.Length   7.0 5.936   4.9
    8  versicolor  Sepal.Width   3.4 2.770   2.0
    9   virginica Petal.Length   6.9 5.552   4.5
    10  virginica  Petal.Width   2.5 2.026   1.4
    11  virginica Sepal.Length   7.9 6.588   4.9
    12  virginica  Sepal.Width   3.8 2.974   2.2