I have a somewhat complicated data structure (a nested list) y
, defined as:
x <- list(
list(1, "a", 2, "b", 0.1),
list(3, "c", 4, "d", 0.2),
list(5, "e", 6, "f", 0.3)
)
y <- rep(list(x), 10)
I also have a data frame df
, defined as:
df <- data.frame(
x1 = c( 0.33, 1.67, -0.62, -0.56, 0.17, 0.73, 0.59, 0.56, -0.22, 1.49),
x2 = c(-0.82, 1.22, 0.65, 0.54, -2.26, 1.21, -0.44, -0.92, -0.56, 0.50),
x3 = c(-0.16, 0.49, -0.82, -0.71, 0.13, 1.22, 1.23, -0.01, -1.11, 0.97)
)
where the column names are not important.
I want to replace y[[i]][[j]][[5]]
with df[[i, j]]
for all i
and j
. My Python/Julia brain works best in loops, so I have accomplished this by looping over y
, and then over the elements of y
(each a copy of x
), like so:
for (i in seq_along(y)) {
for (j in 1:3) {
y[[i]][[j]][[5]] <- df[[i, j]]
}
}
This works, but for my much larger data set, it is really slow. So I am trying to vectorize the nested for
loop. I have been experimenting with Map()
:
y_new <- y
for (j in 1:3) {
y_new <- Map(function(sublist, value) { sublist[[5]] <- value; sublist },
y_new, df[, j])
}
but the above does not work, since identical(y, y_new)
returns FALSE
. I think I am missing a level of subsetting.
I am not married to Map()
at all. I am just looking for the fastest possible alternative to the nested for
loop.
@akrun's suggestion to unlist and relist is elegant and highly idiomatic ("R-like").
But I would want to do that without coercions, especially from numeric to character and back, which can be slow and result in loss of precision. Something like this would be faster and safer:
unlist0 <- function(x) unlist(x, recursive = FALSE, use.names = FALSE)
split0 <- function(x, f) unname(split(x, f))
n <- length(y) # 10
n1 <- length(y[[1L]]) # 3
n11 <- length(y[[1L]][[1L]]) # 5
uy <- unlist0(unlist0(y))
uy[seq.int(n11, n * n1 * n11, n11)] <- as.list(t(df))
suy <- split0(split0(uy, gl(n * n1, n11)), gl(n, n1))
Here's a benchmark:
unlist0 <- function(x) unlist(x, recursive = FALSE, use.names = FALSE)
split0 <- function(x, f) unname(split(x, f))
n <- length(y) # 10
n1 <- length(y[[1L]]) # 3
n11 <- length(y[[1L]][[1L]]) # 5
library(purrr)
microbenchmark::microbenchmark(
colebrookson =
{
ans <- y
for (i in seq_len(n))
for (j in seq_len(n1))
ans[[i]][[j]][[n11]] <- df[[i, j]]
ans
},
TarJae =
{
map2(y, asplit(df, 1L), ~ map2(.x, .y, ~ { .x[[n11]] <- .y; .x }))
},
akrun.1 =
{
Map(function(u, v) Map(function(uu, vv) { uu[5L] <- vv; uu }, u, v), y, asplit(df, 1L))
},
akrun.2 =
{
uy <- unlist(y)
uy[seq.int(n11, n * n1 * n11, n11)] <- c(t(df))
type.convert(relist(uy, y), as.is = TRUE)
},
`Mikael Jagan` =
{
uy <- unlist0(unlist0(y))
uy[seq.int(n11, n * n1 * n11, n11)] <- as.list(t(df))
split0(split0(uy, gl(n * n1, n11)), gl(n, n1))
},
times = 1000L
)
Unit: microseconds
expr min lq mean median uq max neval
colebrookson 1116.635 1171.6365 1318.72441 1195.2115 1238.077 16936.567 1000
TarJae 297.783 314.8390 365.66412 331.7105 352.026 1554.679 1000
akrun.1 76.096 82.4305 96.82716 87.0840 91.676 2076.117 1000
akrun.2 1206.343 1231.1685 1345.24661 1244.2270 1261.222 5023.197 1000
Mikael Jagan 35.465 40.9590 51.61260 45.7765 50.594 1271.984 1000
Some remarks:
identical
. It differs due to loss of precision.