Search code examples
rdplyrtidyrnested-lists

Tidyverse/Dplyr solution to assigning values to column names extracted from a nested list


I have a tibble with a column containing a nested list (<list<list<double>>> data type to be specific).

It looks something like the following (but in R/Arrow format):

ID nestedvals
001 [[1]] (1,0.1) [[2]] (2,0.2) [[3]] (3,0.3) [[4]] (4,0.4) [[5]] (5,0.5)
002 [[1]] (1,0.1) [[2]] (2,0.2) [[3]] (3,0.3) [[4]] (4,0.4)
003 [[1]] (1,0.1) [[2]] (2,0.2) [[3]] (3,0.3)
004 [[1]] (1,0.1) [[2]] (2,0.2)
005 [[1]] (1,0.1)

Where if I call row 1 of nestedvals I get:

tibble$nestedvals[1]
<list<list<double>>[1]>
[[1]]
<list<double>[5]>
[[1]]
[1] 1   0.1

[[2]]
[1] 2   0.2

[[3]]
[1] 3   0.3

[[4]]
[1] 4   0.4

[[5]]
[1] 5   0.5

Basically, for the nestedvals column, there is a list containing a list of pairs of doubles, with the first indicating a specific index (e.g. 5), and a certain value (e.g. 0.5).

What I would like to do is generate set of zero-filled columns based on the range of unique indexes of each nested list. e.g.: col_1, col_2, col_3, col_4, col_5

And then replace each 0 with the value (second number in the nested list), based on the index (first number in each nested list), for each row of the tibble.

I believed the best way to do this involves unlisting the variables and making separate columns with the lists of indexes and lists of values of interest, both so I can find the maximum for name generation in the former and for assignment between to two.

To accomplish this, I wrote a function to split each nested list:

  nestsplit <- function(x, y) {
    unlist(lapply(x, `[[`, y))
  }

And then generate unique columns with the column names (by index) and values of interest to append to the tibble:

  tibble <-
    tibble |> rowwise() |> mutate(index_names = list(paste0(
      "col_", as.character(nestsplit(nestedvals, 1))
    )),
    index_values = list(nestsplit(nestedvals, 2)))

But I would like to see if there is an efficient, row-wise, tidyverse/dplyr-based solution to assign the values in index_values to index-based column names using the information in the index_names variable, rather than writing a loop to assign each of them by row.

So that an output like this:

ID nestedvals col_1 col_2 col_3 col_4 col_5
001 <Nested list of 5 pairs of values> 0 0 0 0 0
002 <Nested list of 4 pairs of values> 0 0 0 0 0
003 <Nested list of 3 pairs of values> 0 0 0 0 0
004 <Nested list of 2 pairs of values> 0 0 0 0 0
005 <Nested list of 1 pair of values> 0 0 0 0 0

Instead looks like the following:

ID nestedvals col_1 col_2 col_3 col_4 col_5
001 <Nested list of 5 pairs of values> 0.1 0.2 0.3 0.4 0.5
002 <Nested list of 4 pairs of values> 0.1 0.2 0.3 0.4 0
003 <Nested list of 3 pairs of values> 0.1 0.2 0.3 0 0
004 <Nested list of 2 pairs of values> 0.1 0.2 0 0 0
005 <Nested list of 1 pair of values> 0.1 0 0 0 0

To generate some example data for above, please use:

tibble <-
  structure(
    list(
      ID = c(001, 002, 003, 004, 005),
      nestedvals = structure(
        list(
          structure(
            list(c(1, 0.1), c(2, 0.2), c(3, 0.3), c(4, 0.4), c(5, 0.5)),
            class = c("arrow_list", "vctrs_list_of", "vctrs_vctr", "list"),
            ptype = numeric(0)
          ),
          structure(
            list(c(1, 0.1), c(2, 0.2), c(3, 0.3), c(4, 0.4)),
            class = c("arrow_list", "vctrs_list_of", "vctrs_vctr", "list"),
            ptype = numeric(0)
          ),
          structure(
            list(c(1, 0.1), c(2, 0.2), c(3, 0.3)),
            class = c("arrow_list", "vctrs_list_of", "vctrs_vctr", "list"),
            ptype = numeric(0)
          ),
          structure(
            list(c(1, 0.1), c(2, 0.2)),
            class = c("arrow_list", "vctrs_list_of", "vctrs_vctr", "list"),
            ptype = numeric(0)
          ),
          structure(
            list(c(1, 0.1)),
            class = c("arrow_list", "vctrs_list_of", "vctrs_vctr", "list"),
            ptype = numeric(0)
          )
        ),
        ptype = structure(
          list(),
          class = c("arrow_list", "vctrs_list_of", "vctrs_vctr", "list"),
          ptype = numeric(0)
        ),
        class = c("arrow_list", "vctrs_list_of", "vctrs_vctr", "list")
      )
    ),
    row.names = c(NA, -5L),
    class = c("tbl_df", "tbl", "data.frame")
  )

Solution

  • tibble%>%
       unnest_longer(nestedvals) %>%
       unnest_wider(nestedvals, names_sep = '_') %>%
       pivot_wider(names_from = nestedvals_1, names_prefix = 'col_',
                   values_from = nestedvals_2, values_fill = 0)
    
    # A tibble: 5 × 6
         ID col_1 col_2 col_3 col_4 col_5
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    1     1   0.1   0.2   0.3   0.4   0.5
    2     2   0.1   0.2   0.3   0.4   0  
    3     3   0.1   0.2   0.3   0     0  
    4     4   0.1   0.2   0     0     0  
    5     5   0.1   0     0     0     0  
    

    Note that in the case the elements are indexed 1:n,like in your example just use unnest_wider:

    unnest_wider(tibble, nestedvals, names_sep = '_', transform = ~.x[2])
    
    # A tibble: 5 × 6
         ID nestedvals_1 nestedvals_2 nestedvals_3 nestedvals_4 nestedvals_5
      <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
    1     1          0.1          0.2          0.3          0.4          0.5
    2     2          0.1          0.2          0.3          0.4         NA  
    3     3          0.1          0.2          0.3         NA           NA  
    4     4          0.1          0.2         NA           NA           NA  
    5     5          0.1         NA           NA           NA           NA  
    

    You can easily replace the NA with 0:

    tibble %>%
      unnest_wider(nestedvals, names_sep = '_', transform = ~.x[2]) %>%
      mutate(across(starts_with('nestedvals'), ~replace_na(., 0)))
    
    # A tibble: 5 × 6
         ID nestedvals_1 nestedvals_2 nestedvals_3 nestedvals_4 nestedvals_5
      <dbl>        <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
    1     1          0.1          0.2          0.3          0.4          0.5
    2     2          0.1          0.2          0.3          0.4          0  
    3     3          0.1          0.2          0.3          0            0  
    4     4          0.1          0.2          0            0            0  
    5     5          0.1          0            0            0            0