Search code examples
rdataframedplyrtidyrdata-wrangling

Reshape dataframe with unequal occurrences of character variables to pivot_wider without removing duplicate names or nesting character variables


I am trying to take a jumbled dataframe that does not have unique identifiers and spread the columns by the names in V1 and keep the values from the V2 and in the same order. I have been looking into pivot_wider() but I have a couple of inherent issues with that function in its simplest form, but I'm not sure how to resolve it.

The original source of this data did not have it organized in an optimal way to easily group.

V1 V2
gene dnaA
locus_tag BABS19_RS0005
old*_*locustag BABS19_I00010
product chromosomal replication initiator protein DnaA
transl_table 11
protein_id ref
inference COORDINATES: similar to AA sequence
gene dnaN
locus_tag BABS19_RS00010
old_locustag BABS19_I00020
product DNA polymerase III subunit beta
EC_number 2.7.7.7
transl_table 11

I'd like to convert it to a frame that looks more like this:

gene locus_tag old_locustag product transl_table protein_id inference EC_number
dnaA BABS19_RS0005 BABS19_I00010 chromosomal replication initiator protein DnaA 11 ref WP_002970905.1 NA
dnaN BABS19_RS00010 BABS19_I00020 DNA polymerase III subunit beta 11 NA NA 2.7.7.7

The tricky part is some information is missing that I don't know how to add to the table. Basically some of the groups have data missing (For example: some have an EC_number and others do not), and it would be great to insert an NA instead of not including the variable, which would result in incorrect placement of variables in the rows, but that is not how the data is presented in its current form. I may need to reshape it but also do not know how to do that.

If I call the original frame "data", my code is:

wide <- data %>% pivot_wider(names_from = V1, values_from = V2)

and the output is:

Warning message:
Values from `V2` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} %>%
  dplyr::group_by(V1) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L) 

What I'm getting is list-col outputs like:

c("dnaA", "dnaN", "recF",...) 

c("BABS19_RS00005", "BABS19_RS00010", "BABS19_RS00015", ...) 

c("BAbS19_I00010", "BAbS19_I00020", "BAbS19_I00030", ...) 

c("chromosomal replication initiator protein DnaA", "DNA polymerase III subunit beta", "DNA replication/repair protein RecF", ...)  c("ref|WP_002970905.1|", "ref|WP_002974246.1|", "ref|WP_002965250.1|", ...)

c("COORDINATES: similar to AA sequence:RefSeq:WP_009363261.1", "COORDINATES: similar to AA sequence:RefSeq:WP_007872890.1", ...) 

Putting unique rownames into the dataframe doesn't help because then every V2 gets it's own row and the rest of the columns in the row have "NA".

I also looked into unnesting the list-cols [unnest_longer] but the different sizes of the lists causes a problem as well.


Solution

  • You need a field to distinguish which gene each row belongs to, which in this case is implied positionally by the number of times we've encountered "gene" in V1.

    library(tidyverse)
    data |>
      mutate(row = cumsum(V1 == "gene")) |>
      pivot_wider(names_from = V1, values_from = V2)
    

    Result

    # A tibble: 2 × 9
        row gene  locus_tag      old_locustag  product                                        transl_table protein_id inference                          EC_number
      <int> <chr> <chr>          <chr>         <chr>                                          <chr>        <chr>      <chr>                              <chr>    
    1     1 dnaA  BABS19_RS0005  BABS19_I00010 chromosomal replication initiator protein DnaA 11           ref        COORDINATES: similar to AA sequen… NA       
    2     2 dnaN  BABS19_RS00010 BABS19_I00020 DNA polymerase III subunit beta                11           NA         NA                                 2.7.7.7