Reshape dataframe with unequal occurrences of character variables to pivot_wider without removing duplicate names or nesting character variables

I am trying to take a jumbled dataframe that does not have unique identifiers and spread the columns by the names in V1 and keep the values from the V2 and in the same order. I have been looking into pivot_wider() but I have a couple of inherent issues with that function in its simplest form, but I'm not sure how to resolve it.

The original source of this data did not have it organized in an optimal way to easily group.

V1	V2
gene	dnaA
locus_tag	BABS19_RS0005
old_locustag	BABS19_I00010
product	chromosomal replication initiator protein DnaA
transl_table	11
protein_id	ref
inference	COORDINATES: similar to AA sequence
gene	dnaN
locus_tag	BABS19_RS00010
old_locustag	BABS19_I00020
product	DNA polymerase III subunit beta
EC_number	2.7.7.7
transl_table	11

I'd like to convert it to a frame that looks more like this:

gene	locus_tag	old_locustag	product	transl_table	protein_id	inference	EC_number
dnaA	BABS19_RS0005	BABS19_I00010	chromosomal replication initiator protein DnaA	11	ref	WP_002970905.1		NA
dnaN	BABS19_RS00010	BABS19_I00020	DNA polymerase III subunit beta	11	NA	NA	2.7.7.7

The tricky part is some information is missing that I don't know how to add to the table. Basically some of the groups have data missing (For example: some have an EC_number and others do not), and it would be great to insert an NA instead of not including the variable, which would result in incorrect placement of variables in the rows, but that is not how the data is presented in its current form. I may need to reshape it but also do not know how to do that.

If I call the original frame "data", my code is:

wide <- data %>% pivot_wider(names_from = V1, values_from = V2)

and the output is:

Warning message:
Values from `V2` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} %>%
  dplyr::group_by(V1) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L)

What I'm getting is list-col outputs like:

c("dnaA", "dnaN", "recF",...) 

c("BABS19_RS00005", "BABS19_RS00010", "BABS19_RS00015", ...) 

c("BAbS19_I00010", "BAbS19_I00020", "BAbS19_I00030", ...) 

c("chromosomal replication initiator protein DnaA", "DNA polymerase III subunit beta", "DNA replication/repair protein RecF", ...)  c("ref|WP_002970905.1|", "ref|WP_002974246.1|", "ref|WP_002965250.1|", ...)

c("COORDINATES: similar to AA sequence:RefSeq:WP_009363261.1", "COORDINATES: similar to AA sequence:RefSeq:WP_007872890.1", ...)

Putting unique rownames into the dataframe doesn't help because then every V2 gets it's own row and the rest of the columns in the row have "NA".

I also looked into unnesting the list-cols [unnest_longer] but the different sizes of the lists causes a problem as well.

Solution

You need a field to distinguish which gene each row belongs to, which in this case is implied positionally by the number of times we've encountered "gene" in V1.

library(tidyverse)
data |>
  mutate(row = cumsum(V1 == "gene")) |>
  pivot_wider(names_from = V1, values_from = V2)

Result

# A tibble: 2 × 9
    row gene  locus_tag      old_locustag  product                                        transl_table protein_id inference                          EC_number
  <int> <chr> <chr>          <chr>         <chr>                                          <chr>        <chr>      <chr>                              <chr>    
1     1 dnaA  BABS19_RS0005  BABS19_I00010 chromosomal replication initiator protein DnaA 11           ref        COORDINATES: similar to AA sequen… NA       
2     2 dnaN  BABS19_RS00010 BABS19_I00020 DNA polymerase III subunit beta                11           NA         NA                                 2.7.7.7