I have a dataset with year-based data predicted by multiple models, in data.table format.
library(data.table)
nYears = 20 # real data: 110
nMod = 3 # real data: ~ 100
nGrp = 45
dataset <- data.table(
group_code = rep(seq(1:nGrp ), times= 3*nYears ),
Year = rep(seq(1:nYears ), each=nGrp ),
value = rnorm(2700 , mean = 10, sd = 2),
var1 = rep (rnorm(nGrp , mean = nMod, sd = 1) , times= nMod*nYears ),
var2 = rep (rnorm(nGrp , mean = 1.5, sd = 0.5) , times= nMod*nYears ),
model = as.character(rep(seq( from = 1, to = nMod ) , each=nGrp *nYears ))
)
setkey(dataset, Year, model)
I need to perform a set of calculations from this dataset based on a vector, named x, of lenght=1001 and consists on a seq(-2, 8, by=0.01)
.
To do so, I created a new data.table (dt) with repeated versions of dataset to merge vector x, accordingly:
dt <- dataset[, lapply(.SD, function(x) rep(x, 1001))]
dt[, x := rep(round(seq(-2, 8, by=0.01), 2), each= nYears*nGrp*nMod) ]
Since my original dataset includes hundreds of models, this operation is not memory efficient.
The most important operation I need, includes the generation of normal distribution of x, with mean = var1 and sd= var2, by group_code, Year and model. For example:
# key computation
dt [, norm_dist := dnorm (x, var1, var2) , by= .(group_code, Year, model )]
This last operation is quite fast in my desktop. However, I have other operations to perform that require to subset the data.table and are highly RAM consuming. An example:
dt[ x %between% c( 2, 5.99), dt2 := rep_len( rev(dt [x %between% c(-2, 1.99)]$value), length.out=.N) , by= .(Year, model) ]
The following error pop-s up:
Error: cannot allocate vector of size 1.3 Gb
I believe the problem in this specific step is related to the subset and the rev() function.
Nevertheless, the approach I'm using to performing the set of calculations based on the vector "x" from data.table dt, does not seem appropriate since the moment I merged the dataset with the vector I need for calculations ("x").
I was hoping someone could teach me how to efficiently improve my code, since I have a considerable amount of models in the original dataset, greatly increasing its size.
Thank you!
I think that this part of code should be clearer
dt[ x %between% c( 2, 5.99), dt2 := rep_len( rev(dt [x %between% c(-2, 1.99)]$value), length.out=.N) , by= .(Year, model) ]
as it is a bit like a black box to me. Especially because this double subsetting is where your problem is generated.
These bits of code, x %between% c( 2, 5.99)
and dt[x %between% c(-2, 1.99)]
, should result always to the same positions in all your cases. You should consider that in your code to make it more efficient.
Try something like this to make things a bit clearer:
by_YM <- split(dt, by=c("Year", "model"))
ind1 <- which(by_YM[[1]][["x"]] %between% c( 2, 5.99))
ind2 <- which(by_YM[[1]][["x"]] %between% c(-2, 1.99))
for(i in 1:length(by_YM)){
dt_i <- by_YM[[i]]
#val1 <- rep_len(rev(dt_i$value[ind2]), length.out=length(ind1)) #val1 is equal to val, no need for rep_len
val <- rev(dt_i$value[ind2])
by_YM[[i]] <- dt_i[ind1, dt2 := val]
}
however ours dt2 columns are not equal but as I am not sure of how the final result should be I cannot debug it.
dt2_a <- dt[Year == 20 & model == 3, dt2]
dt2_b <- by_YM[["20.3"]][, dt2]
test <- cbind(dt2_a, dt2_b)
The second code is also much faster.
library(microbenchmark)
microbenchmark( "new_code" = {
by_YM <- split(dt, by=c("Year", "model"))
ind1 <- which(by_YM[[1]][["x"]] %between% c( 2, 5.99))
ind2 <- which(by_YM[[1]][["x"]] %between% c(-2, 1.99))
for(i in 1:length(by_YM)){
dt_i <- by_YM[[i]]
val1 <- rep_len(rev(dt_i$value[ind2]), length.out=length(ind1)) #val1 is equal to val, no need for rep_len
val <- rev(dt_i$value[ind2])
by_YM[[i]] <- dt_i[ind1, dt2 := val]
}}, "old_code" = dt[ x %between% c( 2, 5.99),
dt2 := rep_len( rev(dt [x %between% c(-2, 1.99)]$value), length.out=.N) , by= .(Year, model) ],
times = 5)
Unit: milliseconds
expr min lq mean median uq max neval cld
new_code 155.426 156.4916 200.6587 185.0347 188.9436 317.3977 5 a
old_code 1290.909 1299.8570 1398.6866 1370.4526 1471.0569 1561.1574 5 b
Give it a try and good luck