Search code examples
rdataframeaggregatesparse-matrix

How to efficiently do aggregate on sparse data


I have a large dataset with 1008412 observations, the columns are customer_id (int), visit_date (Date, format: "2010-04-04"), visit_spend (float).

This date function for the aggregate maps week numbers of interest to the range 13-65:

weekofperiod <- function(dt) {
    as.numeric(format(as.Date(dt), "%W")) + 52 * (as.numeric(format(as.Date(dt), "%Y"))-2010)
}

Each customer_id has a variable number of total visits over a 53-week period. For each customer_id, I want to get the aggregate of spend_per_week, by weekofperiod(). The code below is functionally correct but very slow - comments to make it faster? Also, aggregate() produces sparse output where weeks without visits are missing, I initialize spend_per_week to 0, then row-wise manually assign the non-zero results from aggregate(), to make sure the result always has 53 rows. Surely that can be done better?

Sample dataset lines look like:

   customer_id visit_date visit_spend 
72          40 2011-03-15       18.38 
73          40 2011-03-20       23.45  
74          79 2010-04-07      150.87 
75          79 2010-04-17      101.90 
76          79 2010-05-02      111.90 

and here's the code with aggregate call and adjustment for empty weeks:

for (cid in all_tt_cids) {
  print_pnq('Getting statistics for cid', cid)

  # Get row indices of the selected subset, for just this cid's records
  I <- which(tt$customer_id==cid & tt$visit_date<="2011-03-31")

  # (other code to compute other per-cid statistics)

  # spend_per_week (mode;mean;sd)
  # Aggregate spend_per_week, but beware this should be 0 for those week with no visits
  spend_per_week <- data.frame(c(list('weekofperiod'=13:65), list('spendperweek'=0)) )
  nonzero_spends_per_week <- aggregate(tt$visit_spend[I], list('weekofperiod'=weekofperiod(tt$visit_date[I])), FUN="sum")
  for (i in 1:nrow(nonzero_spends_per_week)) {
    spend_per_week[spend_per_week$weekofperiod==nonzero_spends_per_week[i,1],2] <- nonzero_spends_per_week[i,2]
  }
  colnames(spend_per_week)[2] <- 'spend_per_week'

  # (code to compute and store per-cid statistics on spend_per_week)

}

Solution

  • Your biggest speed up is going to come if you replace the for loops. I can't quite tell from your example, because you overwrite each customer in the loop, but here's one way to do it if you want to keep the info for all subjects.

    For testing, first define functions for the original method, and a new method without loops:

    weekofperiod <- function(dt) {
      as.numeric(format(as.Date(dt), "%W")) + 52 * (as.numeric(format(as.Date(dt), "%Y"))-2010)
    }
    
    FastMethod <- function(tt) {  
      tt$week = weekofperiod(tt$visit_date)
      spend_per_week.tmp = as.data.frame(tapply(tt$visit_spend, tt[,c(1,4)], sum))
      spend_per_week = data.frame(matrix(0, nrow=nrow(spend_per_week.tmp), ncol=length(13:65)))
      colnames(spend_per_week) = 13:65
      rownames(spend_per_week) = rownames(spend_per_week.tmp)
      spend_per_week[, colnames(spend_per_week.tmp)] = spend_per_week.tmp
      spend_per_week
    }
    
    OrigMethod <- function(tt) {
      all_tt_cids = unique(tt$customer_id)
    
      for (cid in all_tt_cids) {
        # Get row indices of the selected subset, for just this cid's records
        I <- which(tt$customer_id==cid & tt$visit_date<="2011-03-31")
    
        # Aggregate spend_per_week, but beware this should be 0 for those week with no visits
        spend_per_week <- data.frame(c(list('weekofperiod'=13:65), list('spendperweek'=0)))
        nonzero_spends_per_week <- aggregate(tt$visit_spend[I], list('weekofperiod'=weekofperiod(tt$visit_date[I])), FUN="sum")
        for (i in 1:nrow(nonzero_spends_per_week)) {
          spend_per_week[spend_per_week$weekofperiod==nonzero_spends_per_week[i,1],2] <- nonzero_spends_per_week[i,2]
        }
        colnames(spend_per_week)[2] <- 'spend_per_week'
      }
      spend_per_week
    }
    

    Now simulate a larger dataset so it's easier to compare:

    n.row  = 10^4
    n.cust = 10^3
    
    customer_id = 1:n.cust
    dates = seq(as.Date('2010-04-01'), as.Date('2011-03-31'), by=1)
    visit_date = sample(dates, n.row, replace=T)
    visit_spend = runif(n.row, 0, 200)
    
    tt = data.frame(customer_id, visit_date, visit_spend)
    

    Finally, compare the two methods:

    > system.time(FastMethod(tt))
       user  system elapsed 
      0.082   0.001   0.083 
    > system.time(OrigMethod(tt))
    
       user  system elapsed 
      4.505   0.007   4.514 
    

    This is already 50x faster, and I bet you can make it even better with more optimization. Good luck!