Search code examples
rdplyrstatisticsdata.tablestata

R equivalent of `table ,contents( )` Stata command for summary statistics


I am trying to mimic the table Stata command in R, which performs summary statistics tables. The command allows you to create cross tables with diverse statistics inside of the resulting cells. For instance, in my example below, I am crossing three variables (category1, category2, and category3) and getting as a column vector the mean and standard deviation of metric1 and the mean and standard deviation metric2.

The stated behavior is obtained with the following single line on Stata.

table category1 category2 category3 ,c(mean metric1 sd metric1 mean metric2 sd metric2) 

Desired output: Explanation of the table.

Here each column vector of the resulting cross table, lets say X of the cross table contains X = [mean(metric1),sd(metric1), mean(metric2),sd(metric2)]'

----------------------------------------------------------------------------
          |                     category3 and category2                     
          | ------------ First -----------    ----------- Second -----------
category1 |      A        B       C   Total         A       B       C   Total
----------+-----------------------------------------------------------------
        1 |  mean(metric1)  
          |  sd(metric1)  
          |  mean(metric2)  
          |  sd(metric1)   

Desired output (!): Resulting Table on Stata.


----------------------------------------------------------------------------
          |                     category3 and category2                     
          | ------------ First -----------    ----------- Second -----------
category1 |      A       B       C   Total         A       B       C   Total
----------+-----------------------------------------------------------------
        1 |  5.778   7.200   2.571   5.048     6.667   3.000   3.000   4.222
          |  2.906   3.347   2.507   3.324     2.309   1.414   1.155   2.333
          | -1.556  -2.000  -1.143  -1.524    -2.000  -2.000  -3.000  -2.444
          |  1.667   0.000   1.069   1.250     0.000   2.828   1.155   1.333
          | 
        2 |  3.200   6.333   4.200   4.571     4.889   5.000   5.000   4.947
          |  2.280   3.445   2.741   2.976     3.180   3.464   2.449   2.857
          | -0.800  -2.000  -2.000  -1.714    -2.222  -1.500  -1.000  -1.684
          |  1.095   1.265   1.333   1.309     1.563   1.000   1.673   1.529
          | 
        3 |  8.667   4.667   5.167   5.667     5.667   6.667   6.000   6.000
          |  2.309   2.309   2.758   2.849     3.445   4.163   3.464   3.303
          | -3.333  -2.667  -2.000  -2.333    -2.333  -2.000  -1.333  -2.000
          |  1.155   1.155   1.477   1.414     0.816   2.000   1.155   1.206
          | 
    Total |  5.529   6.286   4.207   5.067     5.444   5.111   4.615   5.100
          |  3.125   3.124   2.795   3.047     3.053   3.333   2.501   2.898
          | -1.647  -2.143  -1.793  -1.833    -2.222  -1.778  -1.692  -1.950
          |  1.618   0.949   1.346   1.342     1.166   1.563   1.601   1.395
----------------------------------------------------------------------------

Stata code that generates the above result.

clear all
set obs 100

set seed 777
gen category1 = runiformint(1,3)
gen category2_num = runiformint(1,3)

gen category2 = "A" if category2_num ==1
replace category2 = "B" if category2_num ==2
replace category2 = "C" if category2_num ==3

drop category2_num

gen category3_num = runiformint(1,2)
gen category3 = "First" if category3_num ==1
replace category3 = "Second" if category3_num ==2

drop category3_num

gen metric1 = round(runiform()*10,2)
gen metric2 = round(runiform()*-4,2)

table category1 category2 category3 /// List of the variables that will create the crosstab
         ,c(mean metric1 sd metric1 /// Mean and std.dev of metric1 as 1st and 2nd rows
         mean metric2 sd metric2)   /// Mean and std.dev of metric2 as 3rd and 4th rows
         row col                    /// Add the over all statistics total rows and cols
         format(%9.3f)              // Decimal style setting.

R attempt.

Here is how I have tackled the problem. However, I am still far from my desired results. Even though I have the same information displayed on the screen, the readability is very poor in the way I am presenting it on R. Additionally, I haven't computed the mean and standard deviation for the total of rows and columns and I did on the Stata output.

Finally, in my opinion, this procedure seems like an overkill solution for such a simple problem. In my context packages are allowed, hence, dplyr or data.table suggestions are welcome.

Include Stata generated data + replication routine.

df <- as.data.frame(structure(list(category1 = structure(c(1, 3, 1, 2, 3, 1, 3, 1,1, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 1, 3, 1, 3, 3, 1, 3, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 3, 3, 2, 2, 2, 3, 1, 2, 3, 2, 3, 2, 2, 1,3, 3, 3, 2, 2, 1, 1, 1, 3, 2, 3, 1, 2, 2, 1, 3, 1, 3, 1, 1, 3,1, 1, 2, 1, 3, 2, 2, 3, 3, 3, 1, 2, 3, 2, 3, 2, 1, 1, 1, 2, 2,2, 1, 3, 2, 2, 2, 3, 3), format.stata = "%9.0g"), 
                    category2 = structure(c("C", "A", "A", "A", "C", "C", "A", "A", "A", "A", "B", "A", "A", "A","A", "B", "A", "C", "C", "B", "C", "A", "A", "C", "A", "B", "C", "B", "C", "C", "A", "C", "B", "B", "A", "B", "C", "A", "B", "B","C", "A", "A", "C", "C", "B", "C", "A", "A", "C", "C", "B", "C", "C", "A", "C", "A", "A", "C", "B", "A", "C", "C", "C", "B", "B","C", "C", "A", "A", "C", "C", "A", "C", "B", "B", "C", "C", "C", "C", "A", "C", "C", "C", "C", "B", "B", "B", "B", "C", "A", "A","C", "C", "A", "A", "A", "B", "B", "C"), format.stata = "%9s"), 
                    category3 = structure(c("First", "Second", "First", "First", "First", "First", "Second", "Second", "First", "Second", "First", "First", "Second", "Second", "First", "Second", "Second", "First", "Second", "First", "First", "First", "First","Second", "First", "First", "Second", "First", "First", "First","First", "First", "First", "Second", "First", "First", "First", "Second", "First", "First", "First", "Second", "First", "First","Second", "Second", "First", "Second", "Second", "Second","First", "First", "First", "Second", "Second", "First", "First","Second", "First", "First", "First", "First", "Second", "First","Second", "Second", "First", "Second", "First", "Second", "First", "Second", "First", "First", "First", "First", "Second","First", "First", "First", "Second", "Second", "First", "First","First", "Second", "First", "Second", "First", "Second","Second", "First", "Second", "First", "First", "Second","Second", "Second", "Second", "First"), format.stata = "%9s"),
                    metric1 = structure(c(0, 10, 0, 0, 8, 4, 4, 8, 8, 2, 4, 4, 6, 2, 6, 8, 6, 4, 4, 10, 10, 4, 6, 8, 6, 2, 4, 4, 6, 0, 6,0, 10, 8, 2, 2, 2, 0, 2, 10, 2, 8, 4, 6, 8, 2, 2, 6, 0, 2,4, 6, 2, 2, 8, 6, 8, 8, 2, 8, 10, 4, 4, 4, 4, 10, 4, 2, 6,4, 6, 4, 10, 2, 8, 6, 8, 2, 6, 6, 6, 4, 8, 6, 8, 2, 10, 2, 6, 2, 10, 4, 8, 0, 10, 6, 4, 2, 8, 8), format.stata = "%9.0g"),
                    metric2 = structure(c(0, -4, 0, 0, -2, -2, -2, -2, -4, -2, -2, -2, -2, -4, 0, 0, -2, -2, -4, -2, 0, -2, -4, -2, -2, -2, -2, -2, -4, 0, -4, -4, -2, -2, -2, -2, -2, -2, -4, -2, -2, -2, -2, -2, 0, -2, -4, -4, -2, -2, 0, -4, -2, 0, -2,-2, 0, -2, -4, 0, -2, -2, 0, 0, -4, -4, 0, -2, 0, -2, -2, -4, 0, -2, -2, -2, 0, -2, -2, -2, -2, -2, -2, 0, 0, 0, -2, 0, -2, -4, 0, 0, 0, -2, -4, -4, 0, -2, -2, -4), format.stata = "%9.0g")), 
               row.names = c(NA,-100L), class = c("tbl_df", "tbl", "data.frame")))

# expand grid for every possible value
prs <- expand.grid(cat1 = unique(df$category1)   ,
                   cat2 = unique(df$category2) ,
                   cat3 = unique(df$category3))

#Number of total combinations 
N <-   nrow(prs)
#Loop over the combinations to get the desired statistis
A <- lapply(1:N, FUN = function(i){
      mean1 <- mean(df[(df$category1 == prs$cat1[i] &  df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric1"])
      sd1   <- sd(df[(df$category1 == prs$cat1[i] &  df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric1"])
        
      mean2 <- mean(df[(df$category1 == prs$cat1[i] &  df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric2"])
      sd2   <- sd(df[(df$category1 == prs$cat1[i] &  df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric2"])
        
      r_list<- list(cat1 = prs$cat1[i],cat2 = prs$cat2[i], cat3 = prs$cat3[i],
                    mean1 = mean1,  sd1 = sd1 , mean2 = mean2, sd2 = sd2)
  return(r_list)
})

#List to data.frame
df_stats <- do.call(rbind.data.frame, A)

Obtained output (but, not my desired output (!) )

# cat1 cat2   cat3    mean1      sd1     mean2       sd2
# 2     1    C  First 2.571429 2.507133 -1.142857 1.0690450
# 21    3    C  First 5.166667 2.757909 -2.000000 1.4770979
# 3     2    C  First 4.200000 2.740641 -2.000000 1.3333333
# 4     1    A  First 5.777778 2.905933 -1.555556 1.6666667
# 5     3    A  First 8.666667 2.309401 -3.333333 1.1547005
# 6     2    A  First 3.200000 2.280351 -0.800000 1.0954451
# 7     1    B  First 7.200000 3.346640 -2.000000 0.0000000
# 8     3    B  First 4.666667 2.309401 -2.666667 1.1547005
# 9     2    B  First 6.333333 3.444803 -2.000000 1.2649111
# 10    1    C Second 3.000000 1.154701 -3.000000 1.1547005
# 11    3    C Second 6.000000 3.464102 -1.333333 1.1547005
# 12    2    C Second 5.000000 2.449490 -1.000000 1.6733201
# 13    1    A Second 6.666667 2.309401 -2.000000 0.0000000
# 14    3    A Second 5.666667 3.444803 -2.333333 0.8164966
# 15    2    A Second 4.888889 3.179797 -2.222222 1.5634719
# 16    1    B Second 3.000000 1.414214 -2.000000 2.8284271
# 17    3    B Second 6.666667 4.163332 -2.000000 2.0000000
# 18    2    B Second 5.000000 3.464102 -1.500000 1.0000000

Solution

  • You could use data.table and magrittr packages as follows:

    library(magrittr)
    library(data.table)
    
    # function to compute the mean and sd
    fun <- function(x, y) list(metric1_meam=mean(x), metric1_sd=sd(x), metric2_meam=mean(y), metric2_sd=sd(y))
    
    # compute the Total column, and A,B,C columns of the desired output as follows and bind them 
    
    setDT(df)[, 'category1' := as.character(category1)]
    
    Y <- rbind(
      df[, fun(metric1, metric2), by=.(category1, category2, category3)],
      df[, fun(metric1, metric2), by=.(category1, category3)][, category2 := 'Total'],
      df[, fun(metric1, metric2), by=.(category2, category3)][, category1 := 'Total'],
      df[, fun(metric1, metric2), by=.(category3)][, c('category1', 'category2') := 'Total']
    )
    
    # generate the desired output
    melt(Y, measure=patterns('metric')) %>% 
      xtabs(formula = value ~ .) %>% 
      ftable(col.vars = c('category3', 'category2'))
    
    
    
    
    
                           category3      First                                      Second                                 
                           category2          A          B          C      Total          A          B          C      Total
    category1 variable                                                                                                      
    1         metric1_meam            5.7777778  7.2000000  2.5714286  5.0476190  6.6666667  3.0000000  3.0000000  4.2222222
              metric1_sd              2.9059326  3.3466401  2.5071327  3.3237959  2.3094011  1.4142136  1.1547005  2.3333333
              metric2_meam           -1.5555556 -2.0000000 -1.1428571 -1.5238095 -2.0000000 -2.0000000 -3.0000000 -2.4444444
              metric2_sd              1.6666667  0.0000000  1.0690450  1.2497619  0.0000000  2.8284271  1.1547005  1.3333333
    2         metric1_meam            3.2000000  6.3333333  4.2000000  4.5714286  4.8888889  5.0000000  5.0000000  4.9473684
              metric1_sd              2.2803509  3.4448028  2.7406406  2.9760952  3.1797973  3.4641016  2.4494897  2.8572264
              metric2_meam           -0.8000000 -2.0000000 -2.0000000 -1.7142857 -2.2222222 -1.5000000 -1.0000000 -1.6842105
              metric2_sd              1.0954451  1.2649111  1.3333333  1.3093073  1.5634719  1.0000000  1.6733201  1.5294382
    3         metric1_meam            8.6666667  4.6666667  5.1666667  5.6666667  5.6666667  6.6666667  6.0000000  6.0000000
              metric1_sd              2.3094011  2.3094011  2.7579087  2.8491485  3.4448028  4.1633320  3.4641016  3.3028913
              metric2_meam           -3.3333333 -2.6666667 -2.0000000 -2.3333333 -2.3333333 -2.0000000 -1.3333333 -2.0000000
              metric2_sd              1.1547005  1.1547005  1.4770979  1.4142136  0.8164966  2.0000000  1.1547005  1.2060454
    Total     metric1_meam            5.5294118  6.2857143  4.2068966  5.0666667  5.4444444  5.1111111  4.6153846  5.1000000
              metric1_sd              3.1248529  3.1238185  2.7951400  3.0469027  3.0529103  3.3333333  2.5012817  2.8982753
              metric2_meam           -1.6470588 -2.1428571 -1.7931034 -1.8333333 -2.2222222 -1.7777778 -1.6923077 -1.9500000
              metric2_sd              1.6179144  0.9492623  1.3464055  1.3424827  1.1659662  1.5634719  1.6012815  1.3950462