Search code examples
rggplot2dplyrcumsum

cumsum per group in dplyr


I am starting to enjoy dplyr but I got stuck on a use case. I want to be able to apply cumsum per group in a dataframe with the package but I can't seem to get it right.

For a demo dataframe I've generated the following data:

set.seed(123)

len = 10 
dates = as.Date('2014-01-01') + 1:len
grp_a = data.frame(dates=dates, group='A', sales=rnorm(len))
grp_b = data.frame(dates=dates, group='B', sales=rnorm(len))
grp_c = data.frame(dates=dates, group='C', sales=rnorm(len))
df = rbind(grp_a, grp_b, grp_c)

This creates a dataframe that looks like:

        dates group       sales
1  2014-01-02     A -0.56047565
2  2014-01-03     A -0.23017749
3  2014-01-04     A  1.55870831
4  2014-01-05     A  0.07050839
5  2014-01-06     A  0.12928774
6  2014-01-02     B  1.71506499
7  2014-01-03     B  0.46091621
8  2014-01-04     B -1.26506123
9  2014-01-05     B -0.68685285
10 2014-01-06     B -0.44566197
11 2014-01-02     C  1.22408180
12 2014-01-03     C  0.35981383
13 2014-01-04     C  0.40077145
14 2014-01-05     C  0.11068272
15 2014-01-06     C -0.55584113

I then go on to create a dataframe for plotting, but with a for loop that I'd like to replace with something cleaner.

pdf = data.frame(dates=as.Date(as.character()), group=as.character(), sales=as.numeric())
for(grp in unique(df$group)){
  subs = filter(df, group == grp) %>% arrange(dates)
  pdf = rbind(pdf, data.frame(dates=subs$dates, group=grp, sales=cumsum(subs$sales)))
}

I use this pdf to create a plot.

p = ggplot() 
p = p + geom_line(data=pdf, aes(dates, sales, colour=group))
p + ggtitle("sales per group")

enter image description here

Is there a better way (a way by using the dplyr methods) to create this dataframe? I've looked at the summarize method but this seems to aggregate a group from N items -> 1 item. This use case seems to break my dplyr flow at the moment. Any suggestions to better approach this?


Solution

  • Ah. After fiddling around I seem to have found it.

    pdf = df %>% group_by(group) %>% arrange(dates) %>% mutate(cs = cumsum(sales))
    

    Output with forloop in question:

    > pdf = data.frame(dates=as.Date(as.character()), group=as.character(), sales=as.numeric())
    > for(grp in unique(df$group)){
    +   subs = filter(df, group == grp) %>% arrange(dates)
    +   pdf = rbind(pdf, data.frame(dates=subs$dates, group=grp, sales=subs$sales, cs=cumsum(subs$sales)))
    + }
    > pdf
            dates group       sales         cs
    1  2014-01-02     A -0.56047565 -0.5604756
    2  2014-01-03     A -0.23017749 -0.7906531
    3  2014-01-04     A  1.55870831  0.7680552
    4  2014-01-05     A  0.07050839  0.8385636
    5  2014-01-06     A  0.12928774  0.9678513
    6  2014-01-02     B  1.71506499  1.7150650
    7  2014-01-03     B  0.46091621  2.1759812
    8  2014-01-04     B -1.26506123  0.9109200
    9  2014-01-05     B -0.68685285  0.2240671
    10 2014-01-06     B -0.44566197 -0.2215949
    11 2014-01-02     C  1.22408180  1.2240818
    12 2014-01-03     C  0.35981383  1.5838956
    13 2014-01-04     C  0.40077145  1.9846671
    14 2014-01-05     C  0.11068272  2.0953498
    15 2014-01-06     C -0.55584113  1.5395087
    

    Output with this line of code:

    > pdf = df %>% group_by(group) %>% mutate(cs = cumsum(sales))
    > pdf
    Source: local data frame [15 x 4]
    Groups: group
    
            dates group       sales         cs
    1  2014-01-02     A -0.56047565 -0.5604756
    2  2014-01-03     A -0.23017749 -0.7906531
    3  2014-01-04     A  1.55870831  0.7680552
    4  2014-01-05     A  0.07050839  0.8385636
    5  2014-01-06     A  0.12928774  0.9678513
    6  2014-01-02     B  1.71506499  1.7150650
    7  2014-01-03     B  0.46091621  2.1759812
    8  2014-01-04     B -1.26506123  0.9109200
    9  2014-01-05     B -0.68685285  0.2240671
    10 2014-01-06     B -0.44566197 -0.2215949
    11 2014-01-02     C  1.22408180  1.2240818
    12 2014-01-03     C  0.35981383  1.5838956
    13 2014-01-04     C  0.40077145  1.9846671
    14 2014-01-05     C  0.11068272  2.0953498
    15 2014-01-06     C -0.55584113  1.5395087