Search code examples
rfrequency

Nested table, within-column-sub-group totals, frequencies and percentages using r "tables" package


Using R, I want to create a table with nested columns (and possibly nested rows). The cell values should include frequencies and within-sub-group totals and percentages (i.e., cell/[sub-group row total]*100).

I would prefer that the solution uses the tables package. I will be outputting to LaTeX.

Three questions:

1) Please explain why I'm getting the following error: Error in Percent("row") : Summary fn not allowed with Percent

library(tables)
set.seed(123)
df <- data.frame(exposure = sample(LETTERS[1:5], 100, TRUE),
             Group = sample(c("GroupX","GroupY"), 100, TRUE),
             disease = as.integer(sample(c(0,1), 100, TRUE)))

num <- function(x) base::sum(x, na.rm=TRUE)
tabular(Factor(exposure)+1~
          Factor(Group)*
          (Heading()*num*Heading(One)*disease*
             ((Total=1)+Percent("row"))), 
        data=df)

2) How can I create the following ideal table WITH additional columns for within-group percentages after each group*disease frequency. Note that persons without disease are not included in the table.

          Group                        
          GroupX         GroupY        
                 num            num    
 exposure Total  disease Total  disease
 A         9      4      13      6     
 B        12      4       9      5     
 C         9      8       9      6     
 D         7      1       8      3     
 E         9      4      15     12     
 All      46     21      54     32

Here is a start:

tabular(Factor(exposure) + 1 ~ 
          Factor(Group) * 
            ((Total = 1) + num *  disease), data = df)

3) The package uses Percent(). Why would one use a logical vector with Percent(). Can you give an example? Would using a logical vector help me with this problem?

This is similar to this question; however, the offered answer calculates incorrect percentages as evidenced by an example with more than 2 columns.


Solution

  • As of version 0.7.72, the tables package can calculate sub-group percentages. Credit for the commits and this answer goes to Duncan Murdoch, the package maintainer.

    The updated source package is available by SVN from rForge. General installation instructions for installing source packages can be found here. Refer to answer 2 below for usage. Binary packages might be available by the time you read this.

    1) The tables package will only calculate one thing per column. Percent is effectively a "summary function" and by defining a new summary function, num, I've asked it to calculate num in all of the columns, and also to calculate Percent in some of them. If I use my num function, I need to move it within the parentheses so it doesn't end up "multiplied" (in the tables grammar sense) by Percent. The following code will generate a count of persons with disease (i.e., disease==1) and it will generate a row percentage (Group Total/Row Total * 100), not the desired cell/(sub-group row total). With versions of tables < 0.7.72, that's as far as we can get.

    library(tables) ## prior to 0.7.72
    
    df <- data.frame(exposure = sample(LETTERS[1:5], 100, TRUE),
                     Group = sample(c("GroupX","GroupY"), 100, TRUE),
                     disease = as.integer(sample(c(0,1), 100, TRUE)))
    
    num <- function(x) base::sum(x, na.rm=TRUE)
    tabular(Factor(exposure)+1~
              Factor(Group)*
              (Heading("Group Total")*(1)+num*disease+Percent("row")),
            data=df)
    

    2) Version 0.7.72 of the tables package will calculate the desired sub-group percentages. It introduces a pseudo-function called Equal().

    set.seed(100)
    library(tables)
    df <- data.frame(exposure = sample(LETTERS[1:5], 100, TRUE),
                     Group = sample(c("GroupX","GroupY"), 100, TRUE),
                     disease = as.integer(sample(c(0,1), 100, TRUE)))
    
    myTable <- tabular(Factor(exposure)+1~
                         Factor(Group)*
                         (Heading("Group Total")*(1)+Factor(disease)*((n=1)+Heading("%")*Percent(Equal(exposure,Group)))),
                       data=df)
    
    myTable
    

    myTable generates the following output:

              Group                                                                
              GroupX                             GroupY                            
                          disease                            disease               
                          0             1                    0             1       
     exposure Group Total n       %     n  %     Group Total n       %     n  %    
     A         5           1      20.00  4 80.00  6           3      50.00  3 50.00
     B        17          12      70.59  5 29.41 10           3      30.00  7 70.00
     C        13           4      30.77  9 69.23 10           6      60.00  4 40.00
     D         8           2      25.00  6 75.00 13           7      53.85  6 46.15
     E         7           3      42.86  4 57.14 11           8      72.73  3 27.27
     All      50          22      44.00 28 56.00 50          27      54.00 23 46.00
    

    From Duncan's explanation,

    "The general way to read the code above is 'show the percentage of the values in the current cell relative to the values in all cells with equal x and y.'

    The x and y are now taken as expressions; it effectively looks through the formula for places where subsetting occurs, and ignores subsetting by other variables."

    The last step is to subset the table, like a matrix, to keep only the desired columns (and/or rows), as demonstrated by the last example in the help file of tabular():

    myTable[,c(1,4,5,6,9,10)]
    

    This gives the final result:

              Group                                              
              GroupX                    GroupY                   
                          disease                   disease      
                          1                         1            
     exposure Group Total n       %     Group Total n       %    
     A         5           4      80.00  6           3      50.00
     B        17           5      29.41 10           7      70.00
     C        13           9      69.23 10           4      40.00
     D         8           6      75.00 13           6      46.15
     E         7           4      57.14 11           3      27.27
     All      50          28      56.00 50          23      46.00