Search code examples
raggregateodb

Difficulty Grouping and Aggregating Large Dataset


I'm having trouble working with a very large data set. I have an Item ID, Purchase Date, and Purchase Quantity.

 str(Output0)
 'data.frame':  183847 obs. of  3 variables:
  $ D: Factor w/ 460 levels "2015-09-21","2015-09-24",..: 3 3 3 3 3 3 3 3 3 3 ...
  $ P: int  1 2 3 4 5 6 7 8 9 10 ...
  $ Q: num  7 1 2 1 1 1 1 1 1 1 ...

As a note, P=Item ID, D=Date, and Q=Purchase Quantity

I would like to sum the purchase quantity for each individual item by a 3 day period (So there may still be duplicate item IDs). For example:

P    Date      Purchase Q
1234     1/1/16         1
1235     1/1/16         1  
1235     1/2/16         1
1235     1/3/16         1
1444     1/1/16         1
1444     1/2/16         1
1444     1/3/16         1

Would look like:

ItemID    DateEndPoint  Purchase Q
1234       1/1/16         1
1235       1/3/16         3  
1444       1/3/16         3

I've tried using:

Output2 <- aggregate(Output0$Q, by=list(PS=P,
               Date = cut(as.Date(Output0$D, format="%d/%m/%Y"),breaks="3 day")), FUN=sum)

but it is coming up with this error:

Error in seq.int(0, to0 - from, by) : 'to' cannot be NA, NaN or infinite

In addition: Warning messages: 1: In min.default(c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, : no non-missing arguments to min; returning Inf 2: In max.default(c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, : no non-missing arguments to max; returning -Inf

I'd also like to do the same for other time periods as needed (1 day, 1 week) so something reproducible would be wonderful.

In response to P Lapointe: I tried the below and it looks great, except that the last column is totalling all items across all dates instead of for each period

 Output1 <- POData%>%mutate(Date=as.Date(POData$`PO Date`,"%m-%d-%Y"),Date_Group=cut(Date,breaks="3 days"))%>%  group_by(POData$`ItemID`,Date_Group)%>%summarise(DateEndPoint=max(Date),Purchase_Q=sum(POData$`POQty`,na.rm=TRUE))

It displays as:

 > View(Output1)
 > str(Output1)
 Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    116749 obs. of  4 variables:
  $ POData$`Item ID`: int  11 11 11 11 11 11 11 11 11 11 ...
  $ Date_Group      : Factor w/ 216 levels "2015-09-21","2015-09-24",..: 4 6 11 13 14 15 18 19 24 25 ...
  $ DateEndPoint    : Date, format: "2015-10-02" "2015-10-08" ...
  $ Purchase_Q      : num  2691020 2691020 2691020 2691020 2691020 ...
  - attr(*, "vars")= chr "POData$`Item ID`"
  - attr(*, "drop")= logi TRUE

Thank you in advance!


Solution

  • Here's how to do that with dplyr. Note that I extended your example by one day to show that it can handle additional 3-day groups. Basically, you want to create a new Date_group column to group on. Then, summarise.

    df <- read.table(text="P    Date      Purchase_Q
    1234     1/1/16         1
    1235     1/1/16         1  
    1235     1/2/16         1
    1235     1/3/16         1
    1444     1/1/16         1
    1444     1/2/16         1
    1444     1/3/16         1
    1444     1/5/16         1",header=TRUE,stringsAsFactors=FALSE)
    
    library(dplyr)
    df%>%
      mutate(Date=as.Date(Date,"%m/%d/%y"),Date_group=cut(Date,breaks="3 days")) %>%
      group_by(P,Date_group) %>%
      summarise(DateEndPoint=max(Date),Purchase_Q=sum(Purchase_Q,na.rm=TRUE))
    
          P Date_group DateEndPoint Purchase_Q
      <int>     <fctr>       <date>      <int>
    1  1234 2016-01-01   2016-01-01          1
    2  1235 2016-01-01   2016-01-03          3
    3  1444 2016-01-01   2016-01-03          3
    4  1444 2016-01-04   2016-01-05          1