Search code examples
rperformancedata.table

Most efficient way (fastest) to modify a data.frame using indexing


Little introduction to the question :

I am developing an ecophysiological model, and I use a reference class list called S that store every object the model need for input/output (e.g. meteo, physiological parameters etc...). This list contains 5 objects (see example below):
- two dataframes, S$Table_Day (the outputs from the model) and S$Met_c(the meteo in input), which both have variables in columns, and observations (input or output) in row.
- a list of parameters S$Parameters.
- a matrix
- a vector

The model runs many functions with a daily time step. Each day is computed in a for loop that runs from the first day i=1 to the last day i=n. This list is passed to the functions that often take data from S$Met_c and/or S$Parameters in input and compute something that is stored in S$Table_Day, using indexes (the ith day). S is a Reference Class list because they avoid copy on modification, which is very important considering the number of computations.

The question itself :

As the model is very slow, I am trying to decrease computation time by micro-benchmarking different solutions.
Today I found something surprising when comparing two solutions to store my data. Storing data by indexing in one of the preallocated dataframes is longer than storing it into an undeclared vector. After reading this, I thought preallocating memory was always faster, but it seems that R performs more operations while modifying by index (probably comparing the length, type etc...).

My question is : is there a better way to perform such operations ? In other words, is there a way for me to use/store more efficiently the inputs/outputs (in a data.frame, a list of vector or else) to keep track of all computations of each day ? For example would it be better to use many vectors (one for each variable) and regroup them in more complex objects (e.g. list of dataframe) at then end ?

By the way, am I right to use Reference Classes to avoid copy of the big objects in S while passing it to functions and modify it from within them ?

Reproducible example for the comparison:

SimulationClass <- setRefClass("Simulation",
                           fields = list(Table_Day = "data.frame",
                                         Met_c= "data.frame",
                                         PerCohortFruitDemand_c="matrix",
                                         Parameters= "list",
                                         Zero_then_One="vector"))
S= SimulationClass$new()
# Initializing the table with dummy numbers : 
S$Table_Day= data.frame(one= 1:10000, two= rnorm(n = 10000), three= runif(n = 10000),Bud_dd= rep(0,10000))
S$Met_c= data.frame(DegreeDays= rnorm(n=10000, mean = 10, sd = 1))


f1= function(i){
    a= cumsum(S$Met_c$DegreeDays[i:(i-1000)])
}

f2= function(i){
    S$Table_Day$Bud_dd[(i-1000):i]= cumsum(S$Met_c$DegreeDays[i:(i-1000)])
}

res= microbenchmark(f1(1000),f2(1000),times = 10000)
autoplot(res)

And the result : Autoplot from Microbenchmark

Also if someone has any experience in programming such models, I am deeply interested in any advice for model development.


Solution

  • I read more about the question, and I'll just write here for prosperity some of the solutions that were proposed on other posts.

    Apparently, reading and writing are both worth to consider when trying to reduce the computation time of assignation to a data.frame by index. The sources are all found in other discussions:

    Several solutions appeared relevant :

    1. Use a matrix instead of a data.frame if possible to leverage in place modification (Advanced R).
    2. Use a list instead of a data.frame, because [<-.data.frame is not a primitive function (Advanced R).
    3. Write functions in C++ and use Rcpp ([from this source] 1)
    4. Use .subset2 to read instead of [ (third source)
    5. Use data.table as recommanded by @JulienNavarre and @Emmanuel-Lin and the different sources, and use either set for data.frame or := if using a data.table is not a problem.
    6. Use [[ instead of [ when possible (index by one value only). This one is not very effective, and very restrictive, so I removed it from the following comparison.

    Here is the analysis of performance using the different solutions :

    The code :

    # Loading packages :
    library(data.table)
    library(microbenchmark)
    library(ggplot2)
    
    # Creating dummy data :
    SimulationClass <- setRefClass("Simulation",
                                   fields = list(Table_Day = "data.frame",
                                                 Met_c= "data.frame",
                                                 PerCohortFruitDemand_c="matrix",
                                                 Parameters= "list",
                                                 Zero_then_One="vector"))
    S= SimulationClass$new()
    S$Table_Day= data.frame(one= 1:10000, two= rnorm(n = 10000), three= runif(n = 10000),Bud_dd= rep(0,10000))
    S$Met_c= data.frame(DegreeDays= rnorm(n=10000, mean = 10, sd = 1))
    
    # Transforming data objects into simpler forms :
    mat= as.matrix(S$Table_Day)
    Slist= as.list(S$Table_Day)
    Metlist= as.list(S$Met_c)
    MetDT= as.data.table(S$Met_c)
    SDT= as.data.table(S$Table_Day)
    
    # Setting up the functions for the tests :
    f1= function(i){
        S$Table_Day$Bud_dd[i]= cumsum(S$Met_c$DegreeDays[i])
    }
    f2= function(i){
        mat[i,4]= cumsum(S$Met_c$DegreeDays[i])
    } 
    f3= function(i){
        mat[i,4]= cumsum(.subset2(S$Met_c, "DegreeDays")[i])
    } 
    f4= function(i){
        Slist$Bud_dd[i]= cumsum(.subset2(S$Met_c, "DegreeDays")[i])
    }
    f5= function(i){
        Slist$Bud_dd[i]= cumsum(Metlist$DegreeDays[i])
    }
    f6= function(i){
        set(S$Table_Day, i=as.integer(i), j="Bud_dd", cumsum(S$Met_c$DegreeDays[i]))
    }
    f7= function(i){
        set(S$Table_Day, i=as.integer(i), j="Bud_dd", MetDT[i,cumsum(DegreeDays)])
    }
    f8= function(i){
        SDT[i,Bud_dd := MetDT[i,cumsum(DegreeDays)]]
    }
    
    
    i= 6000:6500
    res= microbenchmark(f1(i),f3(i),f4(i),f5(i),f7(i),f8(i), times = 10000)
    autoplot(res)
    

    And the resulting autoplot :

    Autoplot

    With f1 the reference base assignment, f2 using a matrix instead of a data.frame, f3 using the combination of .subset2 and matrix, f4 using a list and .subset2, f5 using two lists (both reading and writing), f6 using data.table::set, f7 using data.table::set and data.table for cumulative sum, and f8using data.table :=.

    As we can see the best solution is to use lists for reading and writing. This is pretty surprising to see that data.table is the worst solution. I believe I did something wrong with it, because it is supposed to be the best. If you can improve it, please tell me.