Search code examples
rdplyrdata.tablerbind

How to improve the speed of rowbinds when the size of the final dataframe is unknown


I want to rowbind multiple dataframes each with a varying number of rows. I understand that using a for loop in which something like a final_df is overwritten in each iteration is very slow as R has to keep a copy of every alteration. Usually, this problem can be solved by pre-allocating the correct number of rows and columns to the dataframe and then modifying it in place for each iteration of the loop. However, in my situation this is a bit more tricky as each individual dataframe may have a different number of rows compared to the previous one. (In my actual code I am dealing with a long list of XML files, from which I am extracting certain bits of information. Depending on the file, I can end up with more rows or fewer rows.)

My attempt so far is to use dplyr::bind_rows() or data.table::rbindlist(), which seem to be performing similarly well and both outclass do.call("rbind") by a lot. However, I notice that even with these approaches, the computation speed will still increase nonlinearly if I increase the number of dataframes.

Do you have suggestions on how I can further improve the speed of my code? Thanks a lot in advance!

create_df <- function() {
  nrow <- sample(12:15, 1)
  ncol <- 10
  randomdf <- matrix(rnorm(nrow*ncol), nrow=nrow, ncol=ncol) |> data.frame()
  return(randomdf)
}

approach1 <- function(n) {
  final_df <<- matrix(ncol=ncol, nrow=0)
  for(i in 1:n) {
    current_df <- create_df()
    final_df <<- rbind(final_df, current_df)
  }
}

approach2 <- function(n) {
  df_list <<- vector("list", n)
  for(i in 1:n) {
    df_list[[i]] <<- create_df()
  }
  final_df <<- do.call("rbind", df_list)
}

approach3 <- function(n) {
  df_list <<- vector("list", n)
  for(i in 1:n) {
    df_list[[i]] <<- create_df()
  }
  final_df <<- dplyr::bind_rows(df_list)
}

approach4 <- function(n) {
  df_list <<- vector("list", n)
  for(i in 1:n) {
    df_list[[i]] <<- create_df()
  }
  final_df <<- data.table::rbindlist(df_list)
}


microbenchmark::microbenchmark(
  approach1(5),
  approach2(5),
  approach3(5),
  approach4(5),
  approach1(50),
  approach2(50),
  approach3(50),
  approach4(50),
  approach1(500),
  approach2(500),
  approach3(500),
  approach4(500),
  times = 10
  )
Unit: microseconds
           expr      min       lq      mean    median       uq      max neval
   approach1(5)   1173.5   1201.1   1317.12   1285.30   1402.2   1557.0    10
   approach2(5)    771.6    781.8   1121.18    829.15    944.6   3573.1    10
   approach3(5)    543.7    613.4    966.10    672.15    952.4   3131.8    10
   approach4(5)    520.8    586.5    641.18    621.65    663.8    818.8    10
  approach1(50)  12186.9  12381.4  13932.40  12760.10  14518.8  18537.4    10
  approach2(50)   6497.6   6766.0   7160.26   6967.55   7230.3   8390.6    10
  approach3(50)   3681.3   4143.1   4258.44   4233.10   4347.8   5022.8    10
  approach4(50)   3806.7   3821.8   4166.71   3962.95   4190.6   5900.4    10
 approach1(500) 275530.0 285779.1 326732.16 294302.30 304461.0 622130.3    10
 approach2(500)  65243.8  67456.7  72789.76  74422.30  77063.0  79485.0    10
 approach3(500)  38600.0  39328.4  41372.67  41215.80  42345.2  47488.8    10
 approach4(500)  32496.5  36788.1  41160.35  39940.10  46043.2  49752.9    10

Solution

  • approach3 and approach4 are spending most of their time in create_df, so you're not getting a good idea of the speed of the binding operation. Better to time just the binding:

    library(dplyr)
    library(data.table)
    
    create_df <- function(n) {
      nrow <- sample(12:15, 1)
      ncol <- 10
      randomdf <- matrix(rnorm(nrow*ncol), nrow=nrow, ncol=ncol) |> data.frame()
      return(randomdf)
    }
    
    df_list <- lapply(c(5, 50, 500), \(n) lapply(1:n, create_df))
    
    approach2 <- function(i) do.call("rbind", df_list[[i]])
    approach3 <- function(i) bind_rows(df_list[[i]])
    approach4 <- function(i) rbindlist(df_list[[i]])
    approach5 <- function(i) rbindlist(df_list[[i]], FALSE)
    
    microbenchmark::microbenchmark(
      approach2(1),
      approach3(1),
      approach4(1),
      approach5(1),
      approach2(2),
      approach3(2),
      approach4(2),
      approach5(2),
      approach2(3),
      approach3(3),
      approach4(3),
      approach5(3)
    )
    #> Unit: microseconds
    #>          expr     min       lq      mean   median       uq     max neval
    #>  approach2(1)   321.1   360.40   389.968   377.25   406.65   601.5   100
    #>  approach3(1)    89.9   118.85   157.806   135.80   191.45   690.2   100
    #>  approach4(1)    77.2    89.05   176.894   103.05   161.15  4250.6   100
    #>  approach5(1)    61.8    70.10   100.532    94.15   120.60   223.7   100
    #>  approach2(2)  3070.4  3228.40  3735.250  3352.30  3574.90  8796.5   100
    #>  approach3(2)   348.3   408.35   470.308   440.50   514.70   931.6   100
    #>  approach4(2)   136.7   169.65   204.703   189.25   222.40   362.6   100
    #>  approach5(2)   111.5   133.85   194.793   150.10   199.50  2957.8   100
    #>  approach2(3) 31565.1 34130.30 36182.204 35523.60 36503.40 89033.4   100
    #>  approach3(3)  3008.7  3268.30  3785.467  3440.65  3714.85  7923.1   100
    #>  approach4(3)   794.4   913.45  1009.823   966.20  1054.20  1692.0   100
    #>  approach5(3)   655.8   767.35   870.240   822.45   894.95  2124.1   100
    

    Now it is clear that rbindlist is the fastest for larger lists of tables. If your process is taking a long time, the binding operation probably isn't the first place I would look.

    If you know the table columns all line up, you can squeeze a little more performance out of rbindlist by setting the use.names argument to FALSE.