Search code examples
rdataframedplyrsummarygroup-summaries

Relative frequencies with dplyr with dynamically created columns pertaining to each group


I'm following very useful solution on creating a summary column for multiple categories. As discussed in the linked solution, I am working with a code that generates the percentage column for each subgroup.

Relevant sample code from the linked solution:

mtcars %>%
  group_by (am, gear) %>%
  summarise (n=n()) %>%
  mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))

The code generates the desired values:

## Source: local data frame [4 x 4]
## Groups: am
## 
##   am gear  n rel.freq
## 1  0    3 15      79%
## 2  0    4  4      21%
## 3  1    4  8      62%
## 4  1    5  5      38%

Problem

I would like modify this code to dynamically created columns pertaining to the unique categories available in the second category passed in the dplyr call. This would be gear in case of the attached example. So in case of the attached example, the resulting data frame would look like that:

   am gear  n rel.freq_gear3 rel.freq_gear4  rel.freq_gear5
 1  0    3 15      79%            21%
 2  1    4  8      0              62%            38%

Attempts

For a small number of categories I presume that I would be able to make use of the summarising the values in conditionally, as discussed here, where I would try to execute dplyr statements only for specified conditions sumBfoo = sum(B[A=="foo"])). However, this approach would be inefficient when dealing with multiple categories. Outside dplyr solution could be developed with use of a loop and jumping through unique values of the desired category but my desire would be to do this in dplyr.

Sample table

Broadly speaking, I would like to create a table similar to the one below:

 library(gmodels)
 CrossTable(mtcars$am, mtcars$gear)


   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|


Total Observations in Table:  32 


             | mtcars$gear 
   mtcars$am |         3 |         4 |         5 | Row Total | 
-------------|-----------|-----------|-----------|-----------|
           0 |        15 |         4 |         0 |        19 | 
             |     4.169 |     1.371 |     2.969 |           | 
             |     0.789 |     0.211 |     0.000 |     0.594 | 
             |     1.000 |     0.333 |     0.000 |           | 
             |     0.469 |     0.125 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|
           1 |         0 |         8 |         5 |        13 | 
             |     6.094 |     2.003 |     4.339 |           | 
             |     0.000 |     0.615 |     0.385 |     0.406 | 
             |     0.000 |     0.667 |     1.000 |           | 
             |     0.000 |     0.250 |     0.156 |           | 
-------------|-----------|-----------|-----------|-----------|
Column Total |        15 |        12 |         5 |        32 | 
             |     0.469 |     0.375 |     0.156 |           | 
-------------|-----------|-----------|-----------|-----------|

But I'm interested only in row proportions without counts and totals and other gadgets.


Solution

  • dplyr

    Building of the comment by @docendo discimus:

    library(tidyr)
    count(mtcars, am, gear) %>% 
      mutate(rel.freq = n/sum(n)) %>% 
      spread(gear, rel.freq) %>% 
      group_by(am) %>%
      summarize_each(funs(sum2 = sum(., na.rm = TRUE))) %>%
      mutate_each(funs(perc = paste0(round(100 * ., 0), "%")), -am, -n)
    

    Produces:

    Source: local data frame [2 x 5]
    
         am     n     3     4     5
      (dbl) (int) (chr) (chr) (chr)
    1     0    19   79%   21%    0%
    2     1    13    0%   62%   38%
    

    base

    prop.table(table(mtcars$am, mtcars$gear), 1) %>% 
      round(2) %>% 
      '*'(100)
    

    Produces:

       3  4  5
    0 79 21  0
    1  0 62 38