Search code examples
rsparse-matrixdata.table

R: Tabulations and insertions with data.table


I am trying to take a very large set of records with multiple indices, calculate an aggregate statistic on groups determined by a subset of the indices, and then insert that into every row in the table. The issue here is that these are very large tables - over 10M rows each.

Code for reproducing the data is below.

The basic idea is that there are a set of indices, say ix1, ix2, ix3, ..., ixK. Generally, I am choosing only a couple of them, say ix1 and ix2. Then, I calculate an aggregation of all the rows with matching ix1 and ix2 values (over all combinations that appear), for a column called val. To keep it simple, I'll focus on a sum.

I have tried the following methods

  1. Via sparse matrices: convert the values to a coordinate list, i.e. (ix1, ix2, val), then create a sparseMatrix - this nicely sums up everything, and then I need only convert back from the sparse matrix representation to the coordinate list. Speed: good, but it is doing more than is necessary and it doesn't generalize to higher dimensions (e.g. ix1, ix2, ix3) or more general functions than a sum.

  2. Use of lapply and split: by creating a new index that is unique for all (ix1, ix2, ...) n-tuples, I can then use split and apply. The bad thing here is that the unique index is converted by split into a factor, and this conversion is terribly time consuming. Try system({zz <- as.factor(1:10^7)}).

  3. I'm now trying data.table, via a command like sumDT <- DT[,sum(val),by = c("ix1","ix2")]. However, I don't yet see how I can merge sumDT with DT, other than via something like DT2 <- merge(DT, sumDT, by = c("ix1","ix2"))

Is there a faster method for this data.table join than via the merge operation I've described?

[I've also tried bigsplit from the bigtabulate package, and some other methods. Anything that converts to a factor is pretty much out - as far as I can tell, that conversion process is very slow.]


Code to generate data. Naturally, it's better to try a smaller N to see that something works, but not all methods scale very well for N >> 1000.

N   <-  10^7
set.seed(2011)
ix1 <-  1 + floor(rexp(N, 0.01))
ix2 <-  1 + floor(rexp(N, 0.01))
ix3 <-  1 + floor(rexp(N, 0.01))
val <-  runif(N)

DF  <-  data.frame(ix1 = ix1, ix2 = ix2, ix3 = ix3, val = val)
DF  <- DF[order(DF[,1],DF[,2],DF[,3]),]
DT  <- as.data.table(DF)

Solution

  • Well, it's possible you'll find that doing the merge isn't so bad as long as your keys are properly set.

    Let's setup the problem again:

    N   <-  10^6      ## not 10^7 because RAM is tight right now
    set.seed(2011)
    ix1 <-  1 + floor(rexp(N, 0.01))
    ix2 <-  1 + floor(rexp(N, 0.01))
    ix3 <-  1 + floor(rexp(N, 0.01))
    val <-  runif(N)
    DT <- data.table(ix1=ix1, ix2=ix2, ix3=ix3, val=val, key=c("ix1", "ix2"))
    

    Now you can calculate your summary stats

    info <- DT[, list(summary=sum(val)), by=key(DT)]
    

    And merge the columns "the data.table way", or just with merge

    m1 <- DT[info]            ## the data.table way
    m2 <- merge(DT, info)     ## if you're just used to merge
    identical(m1, m2)
    [1] TRUE
    

    If either of those ways of merging is too slow, you can try a tricky way to build info at the cost of memory:

    info2 <- DT[, list(summary=rep(sum(val), length(val))), by=key(DT)]
    m3 <- transform(DT, summary=info2$summary)
    identical(m1, m3)
    [1] TRUE
    

    Now let's see the timing:

    #######################################################################
    ## Using data.table[ ... ] or merge
    system.time(info <- DT[, list(summary=sum(val)), by=key(DT)])
       user  system elapsed 
      0.203   0.024   0.232
    
    system.time(DT[info])
       user  system elapsed 
      0.217   0.078   0.296
    
    system.time(merge(DT, info))
       user  system elapsed 
      0.981   0.202   1.185
    
    ########################################################################
    ## Now the two parts of the last version done separately:
    system.time(info2 <- DT[, list(summary=rep(sum(val), length(val))), by=key(DT)])
       user  system elapsed 
      0.574   0.040   0.616 
    
    system.time(transform(DT, summary=info2$summary))
       user  system elapsed 
      0.173   0.093   0.267
    

    Or you can skip the intermediate info table building if the following doesn't seem too inscrutable for your tastes:

    system.time(m5 <- DT[ DT[, list(summary=sum(val)), by=key(DT)] ])
       user  system elapsed 
      0.424   0.101   0.525 
    
    identical(m5, m1)
    # [1] TRUE