Search code examples
raggregateplyr

cumsum using ddply


I need to use group by in levels with ddply or aggregate if that's easier. I am not really sure how to do this as I need to use cumsum as my aggregate function. This is what my data looks like:

level1      level2  hour     product 
A           tea     0          7
A           tea     1          2
A           tea     2          9
A           coffee  17         7
A           coffee  18         2
A           coffee  20         4
B           coffee  0          2
B           coffee  1          3
B           coffee  2          4
B           tea     21         3
B           tea     22         1

expected output:

A     tea     0   7
A     tea     1   9
A     tea     2   18
A     coffee  17  7
A     coffee  18  9
A     coffee  20  13
B     coffee  0   2
B     coffee  1   5
B     coffee  2   9
B     tea     21  3
B     tea     22  4

I tried using

ddply(dd,c("level1","level2","hour"),summarise,cumsum(product))

but that doesn't sum up which I think is because the hour column is being used for group by and its being split by that..I think.. I am not sure I completely understand how aggregate works here. Is there any way I could get the required output using aggregate or ddply?


Solution

  • Here's a solution in base R using ave and within:

    within(mydf, {
      cumsumProduct <- ave(product, level1, level2, FUN = cumsum)
    })
    #    level1 level2 hour product cumsumProduct
    # 1       A    tea    0       7             7
    # 2       A    tea    1       2             9
    # 3       A    tea    2       9            18
    # 4       A coffee   17       7             7
    # 5       A coffee   18       2             9
    # 6       A coffee   20       4            13
    # 7       B coffee    0       2             2
    # 8       B coffee    1       3             5
    # 9       B coffee    2       4             9
    # 10      B    tea   21       3             3
    # 11      B    tea   22       1             4
    

    Of course, if you wanted to drop the existing product column, you can change the command to the following to overwrite the current "product" column:

    within(mydf, {
      product <- ave(product, level1, level2, FUN = cumsum)
    })
    

    Your current approach doesn't work in part because you've included "hour" as one of your grouping variables. In other words, it is seeing the combination of "A + tea + 0" as different from "A + tea + 1", but from your desired output, you seem to simply want the combination of "A + tea" to be the group.

    aggregate won't work as you expect, because it will condense everything into a data.frame with the same number of rows as the number of unique combinations of "level1" and "level2", in this case, 4 rows. The aggregated column would be a list. The values would be correct, but it would be less useful.

    Here's aggregate and its output:

    > aggregate(product ~ level1 + level2, mydf, cumsum)
      level1 level2  product
    1      A coffee 7, 9, 13
    2      B coffee  2, 5, 9
    3      A    tea 7, 9, 18
    4      B    tea     3, 4