Search code examples
rdplyrtidyverse

Transform, Flatten, Unlist a data frame with multiple several types in R


I have a data.frame with several columns. Each column is of different "class". For instance:

Column 1: is a list, "id", with individual 7803 elements.

Column 2: "location" is character (7803 rows where each is a character).

Column 3: is a list, "alleles", with individual 7803 elements.

Column 4: is a list of lists, "clinical_significance" with 7803 elements, where each may have one to three elements inside.

Here is an example of how it looks:

Image of data.frame

and here is a small subset with dput():

structure(list(id = list("rs1585931494", "rs1253996056", "rs368528867", 
    "rs397507487", "rs1291775716", "rs1205853831", "rs555976452", 
    "rs727502904", "rs1481562268"), location = c("1:140734725-140734725", 
"1:140734735-140734735", "1:140734742-140734742", "1:140734743-140734743", 
"1:140734752-140734752", "1:140734755-140734755", "1:140734758-140734758", 
"1:140734763-140734763", "1:140734764-140734764"), alleles = list(
    structure(c("G", "A"), .Dim = 2:1), structure(c("C", "A"), .Dim = 2:1), 
    structure(c("C", "A", "T"), .Dim = c(3L, 1L)), structure(c("G", 
    "A"), .Dim = 2:1), structure(c("G", "C"), .Dim = 2:1), structure(c("C", 
    "A"), .Dim = 2:1), structure(c("T", "A", "C"), .Dim = c(3L, 
    1L)), structure(c("G", "A", "T"), .Dim = c(3L, 1L)), structure(c("C", 
    "A", "T"), .Dim = c(3L, 1L))), clinical_significance = list(
    list(), list(), structure("uncertain significance", .Dim = c(1L, 
    1L)), list(), list(), list(), list(), structure(c("uncertain significance", 
    "likely pathogenic"), .Dim = 2:1), structure("likely pathogenic", .Dim = c(1L, 
    1L))), consequence_type = list("missense_variant", "missense_variant", 
    "missense_variant", "missense_variant", "missense_variant", 
    "stop_gained", "missense_variant", "missense_variant", "missense_variant"), 
    gene_symbol = c("ENSG00000139618", "ENSG00000139618", "ENSG00000139618", 
    "ENSG00000139618", "ENSG00000139618", "ENSG00000139618", 
    "ENSG00000139618", "ENSG00000139618", "ENSG00000139618")), row.names = c(3544L, 
3545L, 3547L, 3548L, 3550L, 3552L, 3554L, 3556L, 3557L), class = "data.frame")

I want a simple data.frame, with a single character value per [row,column]. I am especially having trouble trying to unlist the clinical_significance list of lists. As it may contain several elements, I just want to collapse them into a single element, separated by a comma. But I am not able to get any close to that.

I have tried the following solutions:

do.call(rbind.data.frame, my_df)

Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = default.stringsAsFactors(),  : 
  invalid list argument: all variables should have the same length


# This "apparently" works but when I try to write it as table, it's an error  
    df <- dplyr::bind_rows(my_df) #or df <- purrr::map_df(my_df, dplyr::bind_rows)
    Error in write.table(df) : unimplemented type 'list' in 'EncodeElement'

I appreciate any feedback or suggestions.


Solution

  • Apologies if I misunderstood what you needed, but try this tidyverse solution:

    df |> 
      as_tibble(rownames=NA) |> 
      rownames_to_column() |> 
      group_by(rowname) |> 
      summarise(across(id:gene_symbol, ~map_chr(., ~paste(., collapse=","))))
    

    Output of the sample data you provided:

    > dat
    # A tibble: 10 x 7
       rowname id           location              alleles clinical_significance consequence_type        gene_symbol    
       <chr>   <chr>        <chr>                 <chr>   <chr>                 <chr>                   <chr>          
     1 478     rs866323699  1:140721551-140721551 G,A,C   ""                    splice_acceptor_variant ENSG00000139618
     2 479     rs1365858617 1:140721572-140721572 G,A     ""                    missense_variant        ENSG00000139618
     3 481     rs955654903  1:140721574-140721574 T,C     ""                    missense_variant        ENSG00000139618
     4 482     rs1291598718 1:140721575-140721575 A,AA    ""                    stop_gained             ENSG00000139618
     5 484     rs35895841   1:140721578-140721578 C,A     ""                    missense_variant        ENSG00000139618
     6 485     rs1389663088 1:140721586-140721586 T,C     ""                    missense_variant        ENSG00000139618
     7 487     rs772872980  1:140721589-140721589 G,A     ""                    missense_variant        ENSG00000139618
     8 489     rs1239580966 1:140721598-140721598 T,C     ""                    missense_variant        ENSG00000139618
     9 490     rs1315761595 1:140721599-140721599 G,A     ""                    stop_gained             ENSG00000139618
    10 491     rs1470673381 1:140721606-140721606 C,G     ""                    missense_variant        ENSG00000139618