Search code examples
rdplyrtidyrdcast

R reshape2 or dplyr/tidyr: calculating values' percentage of sub-total


Let's use the airquality dataset as a base.

myaqm <- melt(airquality, id=c("Month", "Day"), na.rm = TRUE)

This gives me a simplification of my real dataset. So far, I've done this:

myaqm_dcast <- dcast(myaqm, Day+variable~Month, value.var = "value", sum, margins=c("Day", "variable", "Month"))

Which gives me this:

> head(myaqm_dcast, n=10)
   Day variable     5     6     7     8     9  (all)
1    1    Ozone  41.0   0.0 135.0  39.0  96.0  311.0
2    1  Solar.R 190.0 286.0 269.0  83.0 167.0  995.0
3    1     Wind   7.4   8.6   4.1   6.9   6.9   33.9
4    1     Temp  67.0  78.0  84.0  81.0  91.0  401.0
5    1    (all) 305.4 372.6 492.1 209.9 360.9 1740.9
6    2    Ozone  36.0   0.0  49.0   9.0  78.0  172.0
7    2  Solar.R 118.0 287.0 248.0  24.0 197.0  874.0
8    2     Wind   8.0   9.7   9.2  13.8   5.1   45.8
9    2     Temp  72.0  74.0  85.0  81.0  92.0  404.0
10   2    (all) 234.0 370.7 391.2 127.8 372.1 1495.8

However, I'm trying to create an additional variables'-percentage-of-days'-subtotal column for each current numeric column. So my goal is something like:

Day     variable        5       5(day %)        6       6(day %)        7       7(day %)        8       8(day %)        9       9(day %)        (all)       (all)(day %)
1       Ozone       41      13.4%       0       0.0%        135     27.4%       39      18.6%       96      26.6%       311     17.9%
1       Solar.R     190     62.2%       286     76.8%       269     54.7%       83      39.5%       167     46.3%       995     57.2%
1       Wind        7.4     2.4%        8.6     2.3%        4.1     0.8%        6.9     3.3%        6.9     1.9%        33.9        1.9%
1       Temp        67      21.9%       78      20.9%       84      17.1%       81      38.6%       91      25.2%       401     23.0%
1       (all)       305.4       100.0%      372.6       100.0%      492.1       100.0%      209.9       100.0%      360.9       100.0%      1740.9      100.0%
2       Ozone       36      15.4%       0       0.0%        49      12.5%       9       7.0%        78      21.0%       172     11.5%
2       Solar.R     118     50.4%       287     77.4%       248     63.4%       24      18.8%       197     52.9%       874     58.4%
2       Wind        8       3.4%        9.7     2.6%        9.2     2.4%        13.8        10.8%       5.1     1.4%        45.8        3.1%
2       Temp        72      30.8%       74      20.0%       85      21.7%       81      63.4%       92      24.7%       404     27.0%
2       (all)       234     100.0%      370.7       100.0%      391.2       100.0%      127.8       100.0%      372.1       100.0%      1495.8      100.0%

Sorry for the terrible formatting! But as you can hopefully see, the new additional columns give a percentage of each variable for that day and that month.

I've found another Stack Overflow helper suggest using tidyr and dplyr but I just couldn't adapt their example to my needs. Would someone please show me what to do?


Solution

  • I wrote a percentage function and used that with dplyr. Then I join the columns together.

    pct <- function(x) {x/sum(x)}
    
    df <- myaqm_dcast %>%
      filter(variable != "(all)") %>%
      group_by(Day) %>%
      mutate_each(funs(pct), 3:8) %>%
      inner_join(myaqm_dcast, by = c("Day", "variable"))
    

    Edit: You can modify the percentage function to print however you need (*100, paste the % symbol).

    Edit 2: If you can live without the (all) rows, I've filtered it out. You can always calculate the column sums using the summarise_each() function.