Search code examples
rdata.tabledplyrsqldfcross-join

Efficient cross join with aggregation and filter


As per the title, I am looking to do a cross join with a table which performs an aggregation function and filters on a couple of variables within the table.

I have similar data to the following:

library(dplyr)
library(data.table)
library(sqldf)

sales <-  data.frame(salesx = c(3000, 2250,850,1800,1700,560,58,200,965,1525)
                     ,week = seq(from = 1, to = 10, by = 1)
                     ,uplift = c(0.04)
                     ,slope = c(100)
                     ,carryover = c(.35))
spend <- data.frame(spend = seq(from = 1, to = 50000, by = 1))

tempdata <- merge(spend,sales,all=TRUE)
tempdata$singledata <- as.numeric(1) 

And here is an example of what I am trying to accomplish via my sql based solution:

newdata <- sqldf("select a.spend, a.week,
                 sum(case when b.week > a.week
                 then b.salesx*(b.uplift*(1-exp(-(power(b.singledata,b.week-a.week)/b.slope))))/b.spend
                 else 0.0 end) as calc3
                 from tempdata a, tempdata b  
                 where a.spend = b.spend 
                 group by a.spend,a.week")

This provides the results I want, but it is a little slow, particularly with my real dataset of around 1 million records. It would be great to have some advice on a) how to speed up the sqldf function; or b) using a more efficient data.table/dplyr approach (I can't get my head around the cross join/aggregation/filter trifecta problem).

Clarity on non-equi join solution below:

I had a couple of questions about the non-equi join solution – output is fine and very quick. In looking to understand how the code worked, I broke it down like this:

breakdown <- setDT(tempdata)[tempdata, .(spend, uplift, slope,carryover,salesx,  singledata, week, i.week,x.week, i.salesx,x.salesx, x.spend, i.spend), on=.(spend, week > week)]

Based on the breakdown, in order to be consistent with the original calculation, it should be:

x.salesx*(uplift*(1.0-exp(-(`^`(singledata,x.week-week)/slope))))/i.spend

The reason why this isn’t apparent is because with the example I used the ‘power’ part of the equation wasn’t really doing anything (always 1). The actual calc used is (adding a carryover variable to data):

SQL

b.salesx*(b.uplift*(1-exp(-(power((b.singledata*b.carryover),b.week-a.week)/b.slope))))/b.spend (sql)

My data.table solution

sum(salesx.y*(uplift.y*(1-exp(-((singledata.y*adstock.y)^(week.y-week.x)/slope.y))))/spend), by=list(spend, week.x)

However, I am unable to get this working with the non equi join solution when adding the ‘carryover’ variable ie.

x.salesx*(uplift*(1.0-exp(-(`^`((singledata*carryover),x.week-week)/slope))))/i.spend

Solution

  • With version 1.9.8 (on CRAN 25 Nov 2016) of data.table non-equi joins were introduced which help to avoid memory-consuming cross joins:

    library(data.table)
    newdata4 <- 
      # coerce to data.table
      setDT(tempdata)[
        # non-equi self-join
        tempdata, on = .(spend, week > week), 
        # compute result
        .(calc3 = sum(salesx*(uplift*(1.0-exp(-(`^`(singledata,week-i.week)/slope))))/i.spend)), 
        # grouped by join parameters
        by = .EACHI][
          # replace NA
          is.na(calc3), calc3 := 0.0][]
    
    # check that results are equal
    all.equal(newdata, as.data.frame(newdata4[order(spend, week)]))
    
    [1] TRUE
    

    Benchmark

    The OP has provided three different solutions, two sqldf variants and one data.table approach using a cross join. These are compared against the non-equi join.

    The code below

    dt_tempdata <- data.table(tempdata)
    microbenchmark::microbenchmark(
      sqldf = {
        newdata <- sqldf("select a.spend, a.week,
                     sum(case when b.week > a.week
                         then b.salesx*(b.uplift*(1-exp(-(power(b.singledata,b.week-a.week)/b.slope))))/b.spend
                         else 0.0 end) as calc3
                         from tempdata a, tempdata b  
                         where a.spend = b.spend 
                         group by a.spend,a.week")
      },
      sqldf_idx = {
        newdata2 <- sqldf(c('create index newindex on tempdata(spend)',
                            'select a.spend, a.week,
                            sum(case when b.week > a.week
                            then b.salesx*(b.uplift*(1-exp(-(power(b.singledata,b.week-a.week)/b.slope))))/b.spend
                            else 0.0 end) as calc3
                            from main.tempdata a left join main.tempdata b  
                            on a.spend = b.spend 
                            group by a.spend,a.week'), dbname = tempfile())
      },
      dt_merge = { 
        newdata3 <- merge(dt_tempdata, dt_tempdata, by="spend", all=TRUE, allow.cartesian=TRUE)[
          week.y > week.x, 
          .(calc3 = sum(salesx.y*(uplift.y*(1-exp(-(singledata.y^(week.y-week.x)/slope.y)))))), 
          by=.(spend, week.x)]
      },
      dt_nonequi = {
        newdata4 <- dt_tempdata[
          dt_tempdata, on = .(spend, week > week), 
          .(calc3 = sum(salesx*(uplift*(1.0-exp(-(`^`(singledata,week-i.week)/slope))))/i.spend)), 
          by = .EACHI][is.na(calc3), calc3 := 0.0]
      },
      times = 3L
    )
    

    returns these timings:

    Unit: seconds
           expr       min        lq      mean    median        uq       max neval cld
          sqldf  9.456110 10.081704 10.647193 10.707299 11.242735 11.778171     3   b
      sqldf_idx 10.980590 11.477774 11.734239 11.974958 12.111064 12.247170     3   b
       dt_merge  3.037857  3.147274  3.192227  3.256692  3.269412  3.282131     3  a 
     dt_nonequi  1.768764  1.776581  1.792359  1.784397  1.804156  1.823916     3  a
    

    For the given problem size, the non-equi join is the fastest, nearly twice as fast as the merge/cross-join data.table approach and 6 times faster than the sqldf codes. Interestingly, index creation and/or temp file usage seems to be rather costly on my system.

    Note that I have streamlined OP's data.table solution.

    Finally, all versions except the merge/cross-join (I have refrained from fixing this version) return the same result.

    all.equal(newdata, newdata2) # TRUE
    all.equal(newdata, as.data.frame(newdata3[order(spend, week.x)])) # FALSE (last week missing)
    all.equal(newdata, as.data.frame(newdata4[order(spend, week)])) # TRUE
    

    Larger problem size

    The OP has reported that the merge/cross-join data.table solution runs out of memory for his production data set of 1 M rows. To verify the non-equi join approach consumes less memory, I have tested it with a problem size of 5 M rows (nrow(tempdata)) which is ten times larger than in the previous benchmark runs. On my PC with 8 GB of memory the run completed without problems in about 18 seconds.

    Unit: seconds
           expr      min       lq     mean   median       uq      max neval
     dt_nonequi 18.12387 18.12657 18.23454 18.12927 18.28987 18.45047     3