Search code examples
rdatesumaggregatecbind

Aggregate/Sum data set by week and by product in R


I have a very large data set that I would like to aggregate both by week/month and by product(a few thousand). Is there a way to do so with a data set in the following format?

Date           product product2 product3
03/03/2011       1        0        7
04/08/2011       3        8        2
03/05/2015       6        3       89
03/01/2017       1        0        2
03/03/2017       6        1        6

which would yield the following:

Date           product product2 product3
wk1-032011       1        0        7
wk2-042011       3        8        2
wk1-032015       6        3       89
wk1-032017       7        1        8


df <- structure(list(Date = c("03/03/2011", "04/04/2011", "03/05/2015", "03/01/2017", "03/03/2017"),
             product= c(1L, 3L, 6L, 1L, 6L),
             product2= c(0L, 8L, 3L, 0L, 1L), 
             product3= c(7L, 2L, 89L, 2L, 6L)), 
             .Names= c("Date", "product", "product2", "product3"), 
             class= "data.frame", row.names=c(NA, -5L))

Solution

  • In base R, you can use as.Date to convert your character df$Date into a Date variable and then use format to with the proper formatting to convert the date into a character variable indicating weekly dates. aggregate is then used to perform the aggregation by the new variable.

    aggregate(df[2:4], list("weeks"=format(as.Date(df$Date, "%m/%d/%Y"), "%Y-%W")), FUN=sum)
        weeks product product2 product3
    1 2011-09       1        0        7
    2 2011-14       3        8        2
    3 2015-09       6        3       89
    4 2017-09       7        1        8
    

    See ?strptime for other date conversions.


    As @akrun mentions in the comments, the data.table analog to the above base R code is

    library(data.table)
    setDT(df)[, lapply(.SD, sum),
              by=.(weeks = format(as.IDate(Date, "%m/%d/%Y"), "%Y-%W"))]
    

    Here, setDT converts the data.frame into a data.table, lapply... calculates the sum where .SD stands for the data.table. This sum is calculated by each unique element that is produced from format(as.IDate(Date, "%m/%d/%Y"), "%Y-%W") where the conversion uses data.table's as.IDate in place of the base R as.Date.