Search code examples
rlistdataframefor-loopvectorization

R4.0 performance: dataframes vs lists, loops vs vectorized - example with constant vector substraction


It is the third time I am reading Hadley Wickham's notorious Advanced R and in the second chapter he explains why iterating over lists is better than over dataframes. Everything seems reasonable, familiar, and expected. The example function subtracts a vector of medians from each column of a dataframe, it is basically a form of centering. To test things out I ran the code provided by Grosser, Bumann & Wickham in Advanced R Solutions.

library(bench)

# Function to generate random dataframe
create_random_df <- function(nrow, ncol) {
  random_matrix <- matrix(runif(nrow * ncol), nrow = nrow)
  as.data.frame(random_matrix)
}

# For loop function that runs on dataframe
subtract_df <- function(x, medians) {
  for (i in seq_along(medians)) {
    x[[i]] <- x[[i]] - medians[[i]]
  }
  x
}

# Same for loop function but that runs on list
subtract_list <- function(x, medians) {
  x <- as.list(x)
  x <- subtract_df(x, medians)
  list2DF(x)
}


benchmark_medians <- function(ncol) {
  df <- create_random_df(nrow = 1e4, ncol = ncol)
  medians <- vapply(df, median, numeric(1), USE.NAMES = FALSE)
  
  bench::mark(
    "data frame" = subtract_df(df, medians),
    "list" = subtract_list(df, medians),
    time_unit = "ms"
  )
}

# Results
results <- bench::press(
  ncol = c(1, 10, 50, 100, 250, 300, 400, 500, 750, 1000),
  benchmark_medians(ncol)
)
#> Running with:
#>     ncol
#>  1     1
#>  2    10
#>  3    50
#>  4   100
#>  5   250
#>  6   300
#>  7   400
#>  8   500
#>  9   750
#> 10  1000

library(ggplot2)

ggplot(
  results,
  aes(ncol, median, col = attr(expression, "description"))
) +
  geom_point(size = 2) +
  geom_smooth() +
  labs(
    x = "Number of Columns",
    y = "Execution Time (ms)",
    colour = "Data Structure"
  ) +
  theme(legend.position = "top")
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Created on 2021-08-08 by the reprex package (v2.0.1)

Again, nothing out of the ordinary, the authors explain that:

When working directly with the data frame, the execution time grows quadratically with the number of columns in the input data. This is because (e.g.) the first column must be copied n times, the second column n-1 times, and so on. When working with a list, the execution time increases only linearly.

Obviously in the long run, linear growth creates shorter run-times, but there is some cost to this strategy — we have to convert between data structures with as.list() and list2DF(). Even though this is fast and probably doesn’t hurt much, the improved approach doesn’t really pay off in this scenario until we get to a data frame that is about 300 columns wide (with the exact value depending on the characteristics of the system running the code).

But then, I thought to myself that I work a lot with dataframes, not lists and surely I would be able to write a function that runs in decent time on a dataframe with many columns. How wrong was I! I now have more questions than when I started: aren't for loops supposed to be slower than vectorized operations? did version 4 of R get new optimizations I am not aware of? why is sweep function so slow? shouldn't matrix/vectorized operations be at least as fast as the loops? does the garbage collector slow things down significantly? what am I doing wrong inside the functions (I surely am in some of them)?

library(bench)
  
# Function to generate random dataframe
create_random_df <- function(nrow, ncol) {
  random_matrix <- matrix(runif(nrow * ncol), nrow = nrow)
  as.data.frame(random_matrix)
}

# For loop function that runs on dataframe
subtract_df <- function(x, medians) {
  for (i in seq_along(medians)) {
    x[[i]] <- x[[i]] - medians[[i]]
  }
  x
}

# Same for loop function but that runs on list
subtract_list <- function(x, medians) {
  x <- as.list(x)
  x <- subtract_df(x, medians)
  list2DF(x)
}
  
# 5 Functions that I thought should be decently fast 
my_mapply <- function(x, medians) {
  as.data.frame(
    mapply(
      function(x, medians){x - medians},
      x,
      medians
    )  
  )
} 

my_sweep <- function(x, medians) {
  sweep(x, 2, medians)
}

my_apply <- function(x, medians) {
  x <- t(apply(x, 1, function(a) a - medians))
  as.data.frame(x)
}

my_vectorized <-function(x, medians) { 
  x <- as.matrix(x)
  x <- t(t(x) - medians)
  x <- as.data.frame(x)
  x
}  

my_vectorized2 <- function(x, medians) {
  ones <- rep(1, nrow(x))
  x_median <- ones %*% t(medians)
  x - x_median
}


benchmark_medians <- function(ncol) {
  df <- create_random_df(nrow = 1e4, ncol = ncol)
  medians <- vapply(df, median, numeric(1), USE.NAMES = FALSE)
  
  bench::mark(
    "data frame" = subtract_df(df, medians),
    "list" = subtract_list(df, medians),
    "my_mapply" = my_mapply(df, medians), 
    "my_sweep" = my_sweep(df, medians),
    # "my_apply" = my_apply(df, medians),   # excluded because it is very very slow compared with the others
    "my_vectorized" = my_vectorized(df, medians),
    "my_vectorized2" = my_vectorized2(df, medians),
    time_unit = "ms"
  )
}

# Have a quick check on dataframe with only 3 columns
benchmark_medians(3)
#> # A tibble: 6 x 6
#>   expression        min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>      <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 data frame     0.0463 0.0884    12769.    4.44MB     55.9
#> 2 list           0.0640 0.0693    12944.  486.86KB    127. 
#> 3 my_mapply      0.152  0.160      5772.    1.05MB    124. 
#> 4 my_sweep       1.14   1.18        809.    2.21MB     38.7
#> 5 my_vectorized  0.208  0.212      4253.     1.1MB     89.2
#> 6 my_vectorized2 0.844  0.875      1074.    1.93MB     46.4

# Results
results <- bench::press(
  ncol = c(1, 10, 50, 100, 250, 300, 400, 500, 750, 1000),
  benchmark_medians(ncol)
)
#> Running with:
#>     ncol
#>  1     1
#>  2    10
#>  3    50
#>  4   100
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#>  5   250
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#>  6   300
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#>  7   400
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#>  8   500
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#>  9   750
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> 10  1000
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.

library(ggplot2)

ggplot(
  results,
  aes(ncol, median, col = attr(expression, "description"))
) +
  geom_point(size = 2) +
  geom_smooth() +
  labs(
    x = "Number of Columns",
    y = "Execution Time (ms)",
    colour = "Data Structure"
  ) +
  theme(legend.position = "top")
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Created on 2021-08-08 by the reprex package (v2.0.1)

These results are pretty much consistent throughout all of my testing, regardles of the characteristics of generated data. Any insights are appreciated and any advice on performance or certain function/family of function use is appreciated.

Thank you.


Solution

  • Aren't for loops supposed to be slower than vectorized operations? / Shouldn't matrix/vectorized operations be at least as fast as the loops?

    It depends. However, I think there is a misunderstanding about vectorized operations (and data.frames). Consider my_vectorized2():

    my_vectorized2 <- function(x, medians) {
      ones <- rep(1, nrow(x))
      x_median <- ones %*% t(medians)
      x - x_median
    }
    

    The "bottleneck" here is the 4th line. Contrary to what you assume (I presume), x - medians dot not trigger vectorized computation: x is not a matrix but a data.frame (i.e., a list). Arithmetic with data.frames works because convenient methods have been implemented. That this should not be taken for granted can be seen in the example below.

    # `+` method for data.frames
    `as.data.frame(1:5) + as.data.frame(6:10)`.
    
    1   7
    2   9
    3  11
    4  13
    5  15
    
    # no `+` method for lists!
    `as.list(1:5) + as.list(6:10)`
    
    Error in as.list(1:5) + as.list(6:10) : 
      non-numeric argument to binary operator
    

    However, these convenient data.frames methods are, in short, less efficient than calculations based on vectors. By profiling the code, we can take a closer look where time is spent:

    df <- create_random_df(nrow = 1e4, ncol = 1000)
    profvis::profvis(my_vectorized2(df, medians))
    

    enter image description here

    By far the most time is spent by Ops.data.frame(), splitting the frame for the "-" method, it seems.

    why is the sweep function so slow? / does the garbage collector slow things down significantly?

    I am not too familiar with the inner workings of sweep() but the poor performance is essentially due to the issue discussed above. In addition, multiple references are created, which results in copies. Garbage collection also seems to have a fair share (this is confirmed by profiling). This is similar to your approaches using mapply() and apply(). my_vectorized() is comparably fast because the computation is vectorized. However, conversion from and to data.frame and transposition with t() is costly (as it results in copies) and thus should be avoided.

    In summary, a piece of advice would be to generally use matrices instead of data.frames when you strive for performance in arithmetic operations.