Search code examples
rloopssubset

Subset a list by dynamic lengths efficiently


My data consists of a large list of integers of various lengths and I want to subset each element to a pre-specified length.

An example of my data:

my_list <- list(c(-4L, -2L), c(4L, 6L, 9L, -4L, 10L, 2L, -3L, 8L), c(-1L, 
                                                          1L), c(-4L, -5L, 5L, -2L, 4L, 10L, 7L), c(-2L, 10L, 3L, -3L, 
                                                                                                    8L, -1L, 7L, 4L, 0L, 2L))

I know the final lengths beforehand and want to essentially pick the first n numbers of each list element based on those calculated lengths.

Let's say those final lengths are:

sizes <- c(1, 7, 0, 5, 8)

This would mean the output should look like:

[[1]]
[1] -4

[[2]]
[1]  4  6  9 -4 10  2 -3

[[3]]
integer(0)

[[4]]
[1] -4 -5  5 -2  4

[[5]]
[1] -2 10  3 -3  8 -1  7  4

As my real data consists of > 500k groups, loops are generally too slow and therefore I would prefer a faster solution.

Any help would be much appreciated.


Solution

  • The simplest code I can think of is to Map the data and the sizes, and subset via head:

    my_list2 <- rep(my_list, 1e5)
    sizes2 <- rep(sizes, 1e5)
    
    system.time({Map(head, my_list2, sizes2)})
    ##   user  system elapsed 
    ##   2.81    0.19    3.00
    

    The speed can be improved 4x by using direct subsetting in the same method:

    system.time(Map(\(l,s) if(s == 0) l[0] else l[1:s], my_list2, sizes2))
    ##   user  system elapsed 
    ##   0.69    0.00    0.69 
    

    Directly altering the list in-place via length<- with a for loop is quicker again:

    system.time({
        for(i in seq_along(my_list2)) {
            length(my_list2[[i]]) <- sizes2[i]
        }
    })
    ##   user  system elapsed 
    ##   0.16    0.02    0.18
    

    The loop returns the same result as the Map option too:

    identical(my_list2, Map(head, my_list2, sizes2))
    ##[1] TRUE