Search code examples
rdataframeunnest

How to unnest a data frame containing list of list with varied length?


I was trying to unnest the the following data frame.

df.org <- structure(list(Gene = "ARIH1", Description = "E3 ubiquitin-protein ligase ARIH1", 
    condition2_cellline = list(c("MCF7", "Jurkat")), condition2_activity = list(
        c(40.8284023668639, 13.26973)), condition2_concentration = list(
        c("100uM", "100uM")), condition3_cellline = list("Jurkat"), 
    condition3_activity = list(-4.60251), condition3_concentration = list(
        "100uM")), row.names = c(NA, -1L), class = c("tbl_df", 
"tbl", "data.frame"))

This is my code:

df.output <- df.ori %>% 
  unnest(where(is.list), keep_empty = T)

This is what I got:

structure(list(Gene = c("ARIH1", "ARIH1"), Description = c("E3 ubiquitin-protein ligase ARIH1", 
"E3 ubiquitin-protein ligase ARIH1"), condition2_cellline = c("MCF7", 
"Jurkat"), condition2_activity = c(40.8284023668639, 13.26973
), condition2_concentration = c("100uM", "100uM"), condition3_cellline = c("Jurkat", 
"Jurkat"), condition3_activity = c(-4.60251, -4.60251), condition3_concentration = c("100uM", 
"100uM")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-2L))

Is there a way to avoid duplicating those variables with a shorter length? The following output is what I want to get.

df.desired <- structure(list(Gene = c("ARIH1", "ARIH1"), Description = c("E3 ubiquitin-protein ligase ARIH1", 
"E3 ubiquitin-protein ligase ARIH1"), condition2_cellline = c("MCF7", 
"Jurkat"), condition2_activity = c(40.8284023668639, 13.26973
), condition2_concentration = c("100uM", "100uM"), condition3_cellline = c(NA, 
"Jurkat"), condition3_activity = c(NA, -4.60251), condition3_concentration = c(NA, 
"100uM")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-2L))

Thanks so much for any help!


Solution

  • We could also do without reshaping i.e. get the max of the list column lengths in a column, then loop across those list columns, modify the length with the max value and use unnest

    library(dplyr)
    library(purrr)
    library(tidyr)
    df.org %>% 
      mutate(l1 = max(across(where(is.list), lengths)),
       across(where(is.list), ~ map(.x, `length<-`, l1)), l1 = NULL) %>% 
       unnest(where(is.list), keep_empty = TRUE)
    

    -output

    # A tibble: 2 × 8
      Gene  Description                       condition2_cellline condition2_activity condition2_concentration condition3_cellline condition3_activity condition3_concentration
      <chr> <chr>                             <chr>                             <dbl> <chr>                    <chr>                             <dbl> <chr>                   
    1 ARIH1 E3 ubiquitin-protein ligase ARIH1 MCF7                               40.8 100uM                    Jurkat                            -4.60 100uM                   
    2 ARIH1 E3 ubiquitin-protein ligase ARIH1 Jurkat                             13.3 100uM                    <NA>                              NA    <NA>