I have a data.frame with several columns. Each column is of different "class". For instance:
Column 1: is a list, "id", with individual 7803 elements.
Column 2: "location" is character (7803 rows where each is a character).
Column 3: is a list, "alleles", with individual 7803 elements.
Column 4: is a list of lists, "clinical_significance" with 7803 elements, where each may have one to three elements inside.
Here is an example of how it looks:
and here is a small subset with dput():
structure(list(id = list("rs1585931494", "rs1253996056", "rs368528867",
"rs397507487", "rs1291775716", "rs1205853831", "rs555976452",
"rs727502904", "rs1481562268"), location = c("1:140734725-140734725",
"1:140734735-140734735", "1:140734742-140734742", "1:140734743-140734743",
"1:140734752-140734752", "1:140734755-140734755", "1:140734758-140734758",
"1:140734763-140734763", "1:140734764-140734764"), alleles = list(
structure(c("G", "A"), .Dim = 2:1), structure(c("C", "A"), .Dim = 2:1),
structure(c("C", "A", "T"), .Dim = c(3L, 1L)), structure(c("G",
"A"), .Dim = 2:1), structure(c("G", "C"), .Dim = 2:1), structure(c("C",
"A"), .Dim = 2:1), structure(c("T", "A", "C"), .Dim = c(3L,
1L)), structure(c("G", "A", "T"), .Dim = c(3L, 1L)), structure(c("C",
"A", "T"), .Dim = c(3L, 1L))), clinical_significance = list(
list(), list(), structure("uncertain significance", .Dim = c(1L,
1L)), list(), list(), list(), list(), structure(c("uncertain significance",
"likely pathogenic"), .Dim = 2:1), structure("likely pathogenic", .Dim = c(1L,
1L))), consequence_type = list("missense_variant", "missense_variant",
"missense_variant", "missense_variant", "missense_variant",
"stop_gained", "missense_variant", "missense_variant", "missense_variant"),
gene_symbol = c("ENSG00000139618", "ENSG00000139618", "ENSG00000139618",
"ENSG00000139618", "ENSG00000139618", "ENSG00000139618",
"ENSG00000139618", "ENSG00000139618", "ENSG00000139618")), row.names = c(3544L,
3545L, 3547L, 3548L, 3550L, 3552L, 3554L, 3556L, 3557L), class = "data.frame")
I want a simple data.frame, with a single character value per [row,column]. I am especially having trouble trying to unlist the clinical_significance list of lists. As it may contain several elements, I just want to collapse them into a single element, separated by a comma. But I am not able to get any close to that.
I have tried the following solutions:
do.call(rbind.data.frame, my_df)
Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = default.stringsAsFactors(), :
invalid list argument: all variables should have the same length
# This "apparently" works but when I try to write it as table, it's an error
df <- dplyr::bind_rows(my_df) #or df <- purrr::map_df(my_df, dplyr::bind_rows)
Error in write.table(df) : unimplemented type 'list' in 'EncodeElement'
I appreciate any feedback or suggestions.
Apologies if I misunderstood what you needed, but try this tidyverse
solution:
df |>
as_tibble(rownames=NA) |>
rownames_to_column() |>
group_by(rowname) |>
summarise(across(id:gene_symbol, ~map_chr(., ~paste(., collapse=","))))
Output of the sample data you provided:
> dat
# A tibble: 10 x 7
rowname id location alleles clinical_significance consequence_type gene_symbol
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 478 rs866323699 1:140721551-140721551 G,A,C "" splice_acceptor_variant ENSG00000139618
2 479 rs1365858617 1:140721572-140721572 G,A "" missense_variant ENSG00000139618
3 481 rs955654903 1:140721574-140721574 T,C "" missense_variant ENSG00000139618
4 482 rs1291598718 1:140721575-140721575 A,AA "" stop_gained ENSG00000139618
5 484 rs35895841 1:140721578-140721578 C,A "" missense_variant ENSG00000139618
6 485 rs1389663088 1:140721586-140721586 T,C "" missense_variant ENSG00000139618
7 487 rs772872980 1:140721589-140721589 G,A "" missense_variant ENSG00000139618
8 489 rs1239580966 1:140721598-140721598 T,C "" missense_variant ENSG00000139618
9 490 rs1315761595 1:140721599-140721599 G,A "" stop_gained ENSG00000139618
10 491 rs1470673381 1:140721606-140721606 C,G "" missense_variant ENSG00000139618