Search code examples
rperformancemergedata.tablememory-efficient

Memory efficient ways of merging a vector with a large data.table to perform calculations (R)


I have a dataset with year-based data predicted by multiple models, in data.table format.

library(data.table)
nYears = 20 # real data: 110
nMod   = 3  # real data: ~ 100
nGrp   = 45

dataset <- data.table(
  group_code = rep(seq(1:nGrp   ),    times= 3*nYears ),
  Year       = rep(seq(1:nYears ),   each=nGrp   ),
  value      = rnorm(2700      , mean = 10, sd = 2),
  var1       = rep (rnorm(nGrp  , mean = nMod, sd = 1) ,  times= nMod*nYears ),
  var2       = rep (rnorm(nGrp  , mean = 1.5, sd = 0.5) , times= nMod*nYears ),
  model   = as.character(rep(seq( from = 1, to = nMod ) ,  each=nGrp  *nYears ))
)
setkey(dataset, Year, model)

I need to perform a set of calculations from this dataset based on a vector, named x, of lenght=1001 and consists on a seq(-2, 8, by=0.01). To do so, I created a new data.table (dt) with repeated versions of dataset to merge vector x, accordingly:

dt  <- dataset[, lapply(.SD, function(x) rep(x, 1001))]
dt[, x :=  rep(round(seq(-2, 8, by=0.01), 2), each= nYears*nGrp*nMod) ]  

Since my original dataset includes hundreds of models, this operation is not memory efficient.

The most important operation I need, includes the generation of normal distribution of x, with mean = var1 and sd= var2, by group_code, Year and model. For example:

 # key computation
 dt [, norm_dist := dnorm (x, var1, var2) , by= .(group_code,  Year, model )]
   

This last operation is quite fast in my desktop. However, I have other operations to perform that require to subset the data.table and are highly RAM consuming. An example:

dt[ x %between% c( 2, 5.99), dt2 := rep_len( rev(dt [x %between% c(-2, 1.99)]$value), length.out=.N) , by= .(Year, model) ] 

The following error pop-s up:

Error: cannot allocate vector of size 1.3 Gb

I believe the problem in this specific step is related to the subset and the rev() function.

Nevertheless, the approach I'm using to performing the set of calculations based on the vector "x" from data.table dt, does not seem appropriate since the moment I merged the dataset with the vector I need for calculations ("x").

I was hoping someone could teach me how to efficiently improve my code, since I have a considerable amount of models in the original dataset, greatly increasing its size.

Thank you!


Solution

  • I think that this part of code should be clearer

    dt[ x %between% c( 2, 5.99), dt2 := rep_len( rev(dt [x %between% c(-2, 1.99)]$value), length.out=.N) , by= .(Year, model) ]
    

    as it is a bit like a black box to me. Especially because this double subsetting is where your problem is generated.

    These bits of code, x %between% c( 2, 5.99) and dt[x %between% c(-2, 1.99)], should result always to the same positions in all your cases. You should consider that in your code to make it more efficient.

    Try something like this to make things a bit clearer:

    by_YM <- split(dt, by=c("Year", "model"))
    ind1  <- which(by_YM[[1]][["x"]] %between% c( 2, 5.99))
    ind2  <- which(by_YM[[1]][["x"]] %between% c(-2, 1.99))
    
    for(i in 1:length(by_YM)){
     
      dt_i <- by_YM[[i]]
      #val1 <- rep_len(rev(dt_i$value[ind2]), length.out=length(ind1)) #val1 is equal to val, no need for rep_len
      val  <- rev(dt_i$value[ind2])
      
      by_YM[[i]] <- dt_i[ind1, dt2 := val] 
      
    }
    

    however ours dt2 columns are not equal but as I am not sure of how the final result should be I cannot debug it.

    dt2_a <- dt[Year == 20 & model == 3, dt2]
    dt2_b <- by_YM[["20.3"]][, dt2]
    
    test  <- cbind(dt2_a, dt2_b)
    

    The second code is also much faster.

    library(microbenchmark)
    
    microbenchmark( "new_code" = {
      by_YM <- split(dt, by=c("Year", "model"))
    
    ind1  <- which(by_YM[[1]][["x"]] %between% c( 2, 5.99))
    ind2  <- which(by_YM[[1]][["x"]] %between% c(-2, 1.99))
    
    for(i in 1:length(by_YM)){
      
      dt_i <- by_YM[[i]]
      val1 <- rep_len(rev(dt_i$value[ind2]), length.out=length(ind1)) #val1 is equal to val, no need for rep_len
      val  <- rev(dt_i$value[ind2])
      
      by_YM[[i]] <- dt_i[ind1, dt2 := val] 
      
    }}, "old_code" = dt[ x %between% c( 2, 5.99), 
                       dt2 := rep_len( rev(dt [x %between% c(-2, 1.99)]$value), length.out=.N) , by= .(Year, model) ],  
    times = 5)
    
    Unit: milliseconds
         expr      min        lq      mean    median        uq       max neval cld
     new_code  155.426  156.4916  200.6587  185.0347  188.9436  317.3977     5  a 
     old_code 1290.909 1299.8570 1398.6866 1370.4526 1471.0569 1561.1574     5   b
    

    Give it a try and good luck