Search code examples
rvectorizationpurrrtibble

Vectorization with lists (as tibble columns) in R?


I am dealing with time series data for different individuals in a wide format. The number of time points differ between individuals. Now, the thing is that I need the last element for each individual.

I was thinking of using a list as a column in my tibble to store the time series sequences. (Putting each time point into a different column is probably not a good idea, since there can be hundreds of possible time points, but an individual could have data only for a handful of them, however, the data per individual is always measured for consecutive time points.)

Let's call it column1, i.e.:

library(tibble)
# Create an example dataframe
df <- tibble(
  column1 = list(1:3, 1:4, 4:8)
)

Now, I would like to use a vectorization for the sake of efficiency and speed, but is it even possible with the given data structure. There is a function called map() in purrr package, with which the operation would go like:

library(purrr)

# Use the map function to select the last element of each vector
last_elements <- map(df$column1, ~ .x[length(.x)])

But this is not vectorization, but rather looping through the elements of the list (stored as column1), right?

Would there be a better (i.e. faster / more efficient) choice for a data structure than a list as a column? Or is this in general the best way to handle this kind of situation?


Solution

  • It is important to understand that vectorization in R means that you do not need a loop, b/c somebody else did the loop already for you. Important part is that eventually there will be some sort of loop.

    The reason why "often" relying on out-of-the-shelf vectorization is faster than writing your loop yourself is that the loop in the former case is done on C level, which can be faster, but sometimes the difference is not even noticeable:

    library(microbenchmark)
    x <- seq_len(10000000L)
     
    microbenchmark(
      vec = sum(x),
      lp = {sm <- 0; for(i in seq_along(x)) sm <- sm + x[i]; sm}
    )
    # Unit: nanoseconds
    #  expr       min        lq      mean    median        uq       max neval cld
    #   vec       100       200      2745       350      4450     19300   100  a 
    #    lp 327515400 340921350 367242236 351582100 383396050 603997400   100   b
    

    Looks like a lot, but be aware of the unit, so basically the loop finishes in 0.32sec while the R loop takes 0.0000001sec. In relative terms this is a huge improvement, but in absolute terms you barely will see a difference.

    Bottom line is that you should not be afraid of R-level loops, b/c they are far better than their reputation. If there's a natural replacement for a loop, go for it by all means, but do not hesitate to use them when the price for avoiding them is an over-bloated data structure.

    Now to your problem at hand.

    You could transform your data into long format and take the last row by individual. You can go even further and use data.table with an appropriate key:

    library(microbenchmark)
    library(tibble)
    library(tidyr)
    library(dplyr)
    library(data.table)
    
    create_data <- function(nr_indiv, max_per_indiv) {
      l <- replicate(nr_indiv, sample(10000L, sample(max_per_indiv, 1L)))
      tibble(id = seq_len(nr_indiv), load = l)
    }
    
    set.seed(20022024)
    list_data <- create_data(10000, 1000)
    lng_data <- unnest(list_data, cols = load)
    dt_data <- copy(lng_data)
    setDT(dt_data)
    setkey(dt_data, "id")
    
    my_check <- function(x) {
      isTRUE(all.equal(x$lst, x$lng %>% pull(load), check.attributes = FALSE)) &&
      isTRUE(all.equal(x$lst, x$dt[, load], check.attributes = FALSE)) &&
      isTRUE(all.equal(x$lst, c(x$ta), check.attributes = FALSE)) &&
      isTRUE(all.equal(x$lst, x$base, check.attributes = FALSE)) &&
      isTRUE(all.equal(x$lst, x$flt %>% pull(load), check.attributes = FALSE))
    }
    
    microbenchmark(
      lst = map_int(list_data$load, tail, n = 1L),
      lng = slice_tail(lng_data, n = 1L, by = id), 
      dt = dt_data[, .(load = tail(load, 1L)), "id"], 
      ta = tapply(lng_data$load, lng_data$id, tail, n = 1L),
      base = vapply(list_data$load, tail, integer(1L), n = 1L),
      flt = lng_data %>% filter(ord == 1L),
      check = my_check
    )
    
    # Unit: milliseconds
    #  expr      min        lq       mean     median         uq       max neval   cld
    #   lst  70.5403  77.40405   93.87981   87.87985  101.53635  280.5060   100 a    
    #   lng 916.2937 971.14420 1053.53934 1004.19125 1133.12665 1434.2933   100  b   
    #    dt  18.4768  19.47870   28.12581   20.64565   37.49055   64.8045   100   c  
    #    ta 315.0710 357.06000  407.04557  368.03110  453.28390  687.4518   100    d 
    #  base  65.5528  75.40010   85.31279   85.87180   94.14265  116.5428   100 a    
    #   flt  44.4467  47.46310   55.71809   50.85525   57.62820  244.5270   100     e
    
    

    The tidyverse long format solution is worse than the purrr loop, the data.table long format solution, is better (not accounting for the overhead of creating the data structure), the filter method lies in between and base loops are at par with purrr so even the slight overhead of purrr is negligible.

    There will be even more solutions, but from this first quick test, it seems to me that the list solution is not that bad overall, and, in my opinion equally (if not even more) importantly, quite easy to read and understand.

    TL;DR: Don't be afraid of loops, if there is no straight forward natural vectorization method use them by all means.