Search code examples
rloopsvectorizationnested-lists

Vectorize replacement of sub-sub list elements in R


I have a somewhat complicated data structure (a nested list) y, defined as:

x <- list(
  list(1, "a", 2, "b", 0.1),
  list(3, "c", 4, "d", 0.2),
  list(5, "e", 6, "f", 0.3)
)
y <- rep(list(x), 10)

I also have a data frame df, defined as:

df <- data.frame(
  x1 = c( 0.33, 1.67, -0.62, -0.56,  0.17, 0.73,  0.59,  0.56, -0.22, 1.49),
  x2 = c(-0.82, 1.22,  0.65,  0.54, -2.26, 1.21, -0.44, -0.92, -0.56, 0.50),
  x3 = c(-0.16, 0.49, -0.82, -0.71,  0.13, 1.22,  1.23, -0.01, -1.11, 0.97)
)

where the column names are not important.

I want to replace y[[i]][[j]][[5]] with df[[i, j]] for all i and j. My Python/Julia brain works best in loops, so I have accomplished this by looping over y, and then over the elements of y (each a copy of x), like so:

for (i in seq_along(y)) {
  for (j in 1:3) {
    y[[i]][[j]][[5]] <- df[[i, j]]
  }
}

This works, but for my much larger data set, it is really slow. So I am trying to vectorize the nested for loop. I have been experimenting with Map():

y_new <- y
for (j in 1:3) {
  y_new <- Map(function(sublist, value) { sublist[[5]] <- value; sublist },
               y_new, df[, j])
}

but the above does not work, since identical(y, y_new) returns FALSE. I think I am missing a level of subsetting.

I am not married to Map() at all. I am just looking for the fastest possible alternative to the nested for loop.


Solution

  • @akrun's suggestion to unlist and relist is elegant and highly idiomatic ("R-like").

    But I would want to do that without coercions, especially from numeric to character and back, which can be slow and result in loss of precision. Something like this would be faster and safer:

    unlist0 <- function(x) unlist(x, recursive = FALSE, use.names = FALSE)
    split0 <- function(x, f) unname(split(x, f))
    
    n <- length(y) # 10
    n1 <- length(y[[1L]]) # 3
    n11 <- length(y[[1L]][[1L]]) # 5
    
    uy <- unlist0(unlist0(y))
    uy[seq.int(n11, n * n1 * n11, n11)] <- as.list(t(df))
    suy <- split0(split0(uy, gl(n * n1, n11)), gl(n, n1))
    

    Here's a benchmark:

    unlist0 <- function(x) unlist(x, recursive = FALSE, use.names = FALSE)
    split0 <- function(x, f) unname(split(x, f))
    
    n <- length(y) # 10
    n1 <- length(y[[1L]]) # 3
    n11 <- length(y[[1L]][[1L]]) # 5
    
    library(purrr)
    
    microbenchmark::microbenchmark(
    colebrookson =
        {
            ans <- y
            for (i in seq_len(n))
                for (j in seq_len(n1))
                    ans[[i]][[j]][[n11]] <- df[[i, j]]
            ans
        },
    TarJae =
        {
            map2(y, asplit(df, 1L), ~ map2(.x, .y, ~ { .x[[n11]] <- .y; .x }))
        },
    akrun.1 = 
        {
            Map(function(u, v) Map(function(uu, vv) { uu[5L] <- vv; uu }, u, v), y, asplit(df, 1L))
        },
    akrun.2 = 
        {
            uy <- unlist(y)
            uy[seq.int(n11, n * n1 * n11, n11)] <- c(t(df))
            type.convert(relist(uy, y), as.is = TRUE)
        },
    `Mikael Jagan` =
        {
            uy <- unlist0(unlist0(y))
            uy[seq.int(n11, n * n1 * n11, n11)] <- as.list(t(df))
            split0(split0(uy, gl(n * n1, n11)), gl(n, n1))
        },
    times = 1000L
    )
    
    Unit: microseconds
             expr      min        lq       mean    median       uq       max neval
     colebrookson 1116.635 1171.6365 1318.72441 1195.2115 1238.077 16936.567  1000
           TarJae  297.783  314.8390  365.66412  331.7105  352.026  1554.679  1000
          akrun.1   76.096   82.4305   96.82716   87.0840   91.676  2076.117  1000
          akrun.2 1206.343 1231.1685 1345.24661 1244.2270 1261.222  5023.197  1000
     Mikael Jagan   35.465   40.9590   51.61260   45.7765   50.594  1271.984  1000
    

    Some remarks:

    • Don't read into benchmarks on small examples.
    • I suspect that only the fourth and fifth answers will scale well, since only those vectorize the replacement.
    • Only the fourth answer is not identical to the first, in the sense of identical. It differs due to loss of precision.