I am trying to take a jumbled dataframe that does not have unique identifiers and spread the columns by the names in V1 and keep the values from the V2 and in the same order. I have been looking into pivot_wider()
but I have a couple of inherent issues with that function in its simplest form, but I'm not sure how to resolve it.
The original source of this data did not have it organized in an optimal way to easily group.
V1 | V2 | |
---|---|---|
gene | dnaA | |
locus_tag | BABS19_RS0005 | |
old*_*locustag | BABS19_I00010 | |
product | chromosomal replication initiator protein DnaA | |
transl_table | 11 | |
protein_id | ref | |
inference | COORDINATES: similar to AA sequence | |
gene | dnaN | |
locus_tag | BABS19_RS00010 | |
old_locustag | BABS19_I00020 | |
product | DNA polymerase III subunit beta | |
EC_number | 2.7.7.7 | |
transl_table | 11 |
I'd like to convert it to a frame that looks more like this:
gene | locus_tag | old_locustag | product | transl_table | protein_id | inference | EC_number | |
---|---|---|---|---|---|---|---|---|
dnaA | BABS19_RS0005 | BABS19_I00010 | chromosomal replication initiator protein DnaA | 11 | ref | WP_002970905.1 | NA | |
dnaN | BABS19_RS00010 | BABS19_I00020 | DNA polymerase III subunit beta | 11 | NA | NA | 2.7.7.7 |
The tricky part is some information is missing that I don't know how to add to the table. Basically some of the groups have data missing (For example: some have an EC_number and others do not), and it would be great to insert an NA instead of not including the variable, which would result in incorrect placement of variables in the rows, but that is not how the data is presented in its current form. I may need to reshape it but also do not know how to do that.
If I call the original frame "data", my code is:
wide <- data %>% pivot_wider(names_from = V1, values_from = V2)
and the output is:
Warning message:
Values from `V2` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
{data} %>%
dplyr::group_by(V1) %>%
dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
dplyr::filter(n > 1L)
What I'm getting is list-col outputs like:
c("dnaA", "dnaN", "recF",...)
c("BABS19_RS00005", "BABS19_RS00010", "BABS19_RS00015", ...)
c("BAbS19_I00010", "BAbS19_I00020", "BAbS19_I00030", ...)
c("chromosomal replication initiator protein DnaA", "DNA polymerase III subunit beta", "DNA replication/repair protein RecF", ...) c("ref|WP_002970905.1|", "ref|WP_002974246.1|", "ref|WP_002965250.1|", ...)
c("COORDINATES: similar to AA sequence:RefSeq:WP_009363261.1", "COORDINATES: similar to AA sequence:RefSeq:WP_007872890.1", ...)
Putting unique rownames into the dataframe doesn't help because then every V2 gets it's own row and the rest of the columns in the row have "NA".
I also looked into unnesting the list-cols [unnest_longer] but the different sizes of the lists causes a problem as well.
You need a field to distinguish which gene each row belongs to, which in this case is implied positionally by the number of times we've encountered "gene" in V1
.
library(tidyverse)
data |>
mutate(row = cumsum(V1 == "gene")) |>
pivot_wider(names_from = V1, values_from = V2)
Result
# A tibble: 2 × 9
row gene locus_tag old_locustag product transl_table protein_id inference EC_number
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 dnaA BABS19_RS0005 BABS19_I00010 chromosomal replication initiator protein DnaA 11 ref COORDINATES: similar to AA sequen… NA
2 2 dnaN BABS19_RS00010 BABS19_I00020 DNA polymerase III subunit beta 11 NA NA 2.7.7.7