My data consists of a large list of integers of various lengths and I want to subset each element to a pre-specified length.
An example of my data:
my_list <- list(c(-4L, -2L), c(4L, 6L, 9L, -4L, 10L, 2L, -3L, 8L), c(-1L,
1L), c(-4L, -5L, 5L, -2L, 4L, 10L, 7L), c(-2L, 10L, 3L, -3L,
8L, -1L, 7L, 4L, 0L, 2L))
I know the final lengths beforehand and want to essentially pick the first n numbers of each list element based on those calculated lengths.
Let's say those final lengths are:
sizes <- c(1, 7, 0, 5, 8)
This would mean the output should look like:
[[1]]
[1] -4
[[2]]
[1] 4 6 9 -4 10 2 -3
[[3]]
integer(0)
[[4]]
[1] -4 -5 5 -2 4
[[5]]
[1] -2 10 3 -3 8 -1 7 4
As my real data consists of > 500k groups, loops are generally too slow and therefore I would prefer a faster solution.
Any help would be much appreciated.
The simplest code I can think of is to Map
the data and the sizes, and subset via head
:
my_list2 <- rep(my_list, 1e5)
sizes2 <- rep(sizes, 1e5)
system.time({Map(head, my_list2, sizes2)})
## user system elapsed
## 2.81 0.19 3.00
The speed can be improved 4x by using direct subsetting in the same method:
system.time(Map(\(l,s) if(s == 0) l[0] else l[1:s], my_list2, sizes2))
## user system elapsed
## 0.69 0.00 0.69
Directly altering the list in-place via length<-
with a for loop is quicker again:
system.time({
for(i in seq_along(my_list2)) {
length(my_list2[[i]]) <- sizes2[i]
}
})
## user system elapsed
## 0.16 0.02 0.18
The loop returns the same result as the Map
option too:
identical(my_list2, Map(head, my_list2, sizes2))
##[1] TRUE