I have a tibble with a column containing a nested list (<list<list<double>>> data type to be specific).
It looks something like the following (but in R/Arrow format):
ID | nestedvals |
---|---|
001 | [[1]] (1,0.1) [[2]] (2,0.2) [[3]] (3,0.3) [[4]] (4,0.4) [[5]] (5,0.5) |
002 | [[1]] (1,0.1) [[2]] (2,0.2) [[3]] (3,0.3) [[4]] (4,0.4) |
003 | [[1]] (1,0.1) [[2]] (2,0.2) [[3]] (3,0.3) |
004 | [[1]] (1,0.1) [[2]] (2,0.2) |
005 | [[1]] (1,0.1) |
Where if I call row 1 of nestedvals I get:
tibble$nestedvals[1]
<list<list<double>>[1]>
[[1]]
<list<double>[5]>
[[1]]
[1] 1 0.1
[[2]]
[1] 2 0.2
[[3]]
[1] 3 0.3
[[4]]
[1] 4 0.4
[[5]]
[1] 5 0.5
Basically, for the nestedvals column, there is a list containing a list of pairs of doubles, with the first indicating a specific index (e.g. 5), and a certain value (e.g. 0.5).
What I would like to do is generate set of zero-filled columns based on the range of unique indexes of each nested list. e.g.: col_1, col_2, col_3, col_4, col_5
And then replace each 0 with the value (second number in the nested list), based on the index (first number in each nested list), for each row of the tibble.
I believed the best way to do this involves unlisting the variables and making separate columns with the lists of indexes and lists of values of interest, both so I can find the maximum for name generation in the former and for assignment between to two.
To accomplish this, I wrote a function to split each nested list:
nestsplit <- function(x, y) {
unlist(lapply(x, `[[`, y))
}
And then generate unique columns with the column names (by index) and values of interest to append to the tibble:
tibble <-
tibble |> rowwise() |> mutate(index_names = list(paste0(
"col_", as.character(nestsplit(nestedvals, 1))
)),
index_values = list(nestsplit(nestedvals, 2)))
But I would like to see if there is an efficient, row-wise, tidyverse/dplyr-based solution to assign the values in index_values to index-based column names using the information in the index_names variable, rather than writing a loop to assign each of them by row.
So that an output like this:
ID | nestedvals | col_1 | col_2 | col_3 | col_4 | col_5 |
---|---|---|---|---|---|---|
001 | <Nested list of 5 pairs of values> |
0 | 0 | 0 | 0 | 0 |
002 | <Nested list of 4 pairs of values> |
0 | 0 | 0 | 0 | 0 |
003 | <Nested list of 3 pairs of values> |
0 | 0 | 0 | 0 | 0 |
004 | <Nested list of 2 pairs of values> |
0 | 0 | 0 | 0 | 0 |
005 | <Nested list of 1 pair of values> |
0 | 0 | 0 | 0 | 0 |
Instead looks like the following:
ID | nestedvals | col_1 | col_2 | col_3 | col_4 | col_5 |
---|---|---|---|---|---|---|
001 | <Nested list of 5 pairs of values> |
0.1 | 0.2 | 0.3 | 0.4 | 0.5 |
002 | <Nested list of 4 pairs of values> |
0.1 | 0.2 | 0.3 | 0.4 | 0 |
003 | <Nested list of 3 pairs of values> |
0.1 | 0.2 | 0.3 | 0 | 0 |
004 | <Nested list of 2 pairs of values> |
0.1 | 0.2 | 0 | 0 | 0 |
005 | <Nested list of 1 pair of values> |
0.1 | 0 | 0 | 0 | 0 |
To generate some example data for above, please use:
tibble <-
structure(
list(
ID = c(001, 002, 003, 004, 005),
nestedvals = structure(
list(
structure(
list(c(1, 0.1), c(2, 0.2), c(3, 0.3), c(4, 0.4), c(5, 0.5)),
class = c("arrow_list", "vctrs_list_of", "vctrs_vctr", "list"),
ptype = numeric(0)
),
structure(
list(c(1, 0.1), c(2, 0.2), c(3, 0.3), c(4, 0.4)),
class = c("arrow_list", "vctrs_list_of", "vctrs_vctr", "list"),
ptype = numeric(0)
),
structure(
list(c(1, 0.1), c(2, 0.2), c(3, 0.3)),
class = c("arrow_list", "vctrs_list_of", "vctrs_vctr", "list"),
ptype = numeric(0)
),
structure(
list(c(1, 0.1), c(2, 0.2)),
class = c("arrow_list", "vctrs_list_of", "vctrs_vctr", "list"),
ptype = numeric(0)
),
structure(
list(c(1, 0.1)),
class = c("arrow_list", "vctrs_list_of", "vctrs_vctr", "list"),
ptype = numeric(0)
)
),
ptype = structure(
list(),
class = c("arrow_list", "vctrs_list_of", "vctrs_vctr", "list"),
ptype = numeric(0)
),
class = c("arrow_list", "vctrs_list_of", "vctrs_vctr", "list")
)
),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame")
)
tibble%>%
unnest_longer(nestedvals) %>%
unnest_wider(nestedvals, names_sep = '_') %>%
pivot_wider(names_from = nestedvals_1, names_prefix = 'col_',
values_from = nestedvals_2, values_fill = 0)
# A tibble: 5 × 6
ID col_1 col_2 col_3 col_4 col_5
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.1 0.2 0.3 0.4 0.5
2 2 0.1 0.2 0.3 0.4 0
3 3 0.1 0.2 0.3 0 0
4 4 0.1 0.2 0 0 0
5 5 0.1 0 0 0 0
Note that in the case the elements are indexed 1:n,like in your example just use unnest_wider
:
unnest_wider(tibble, nestedvals, names_sep = '_', transform = ~.x[2])
# A tibble: 5 × 6
ID nestedvals_1 nestedvals_2 nestedvals_3 nestedvals_4 nestedvals_5
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.1 0.2 0.3 0.4 0.5
2 2 0.1 0.2 0.3 0.4 NA
3 3 0.1 0.2 0.3 NA NA
4 4 0.1 0.2 NA NA NA
5 5 0.1 NA NA NA NA
You can easily replace the NA
with 0
:
tibble %>%
unnest_wider(nestedvals, names_sep = '_', transform = ~.x[2]) %>%
mutate(across(starts_with('nestedvals'), ~replace_na(., 0)))
# A tibble: 5 × 6
ID nestedvals_1 nestedvals_2 nestedvals_3 nestedvals_4 nestedvals_5
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.1 0.2 0.3 0.4 0.5
2 2 0.1 0.2 0.3 0.4 0
3 3 0.1 0.2 0.3 0 0
4 4 0.1 0.2 0 0 0
5 5 0.1 0 0 0 0