Split uneven string in R - variable substring and delimiters

Before posting my question, I would like to emphasize that I did find similar things here but nothing quite like what I need.

I am dealing with FASTA files, more precisely with the FASTA headers, which look like this: sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus OX=9913 GN=C3 PE=1 SV=2

I need to extract the bolded text. The first bolded text is the protein name. The second bold is the gene name. Please, note that they vary, and I start the analysis with multiple fasta headers inside the same string. Only the first header matters, the rest is crap. Here is an example:

> proteinGroups$Fasta.headers
[1] "sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus OX=9913 GN=C3 PE=1 SV=2;tr|A0A0F6QNP7|A0A0F6QNP7_BOVIN C3-beta-c OS=Bos taurus OX=9913 GN=C3 PE=2 SV=1;tr|A0A3Q1MHV6|A0A3Q1MHV6_BOVIN C3-beta-c OS=Bos taurus OX=9913 GN=C3 PE=1 SV=1;tr|A0A3Q1M2B2|A0A3Q1M2B2_B"
[2] "tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bos taurus OX=9913 GN=HP PE=3 SV=1;sp|Q2TBU0|HPT_BOVIN Haptoglobin OS=Bos taurus OX=9913 GN=HP PE=2 SV=1;tr|A0A0M4MD57|A0A0M4MD57_BOVIN Haptoglobin OS=Bos taurus OX=9913 GN=HP PE=2 SV=1;tr|G3X6K8|G3X6K8_BOVIN H"
[3] "tr|A0A3Q1LH05|A0A3Q1LH05_BOVIN Anion exchange protein OS=Bos taurus OX=9913 GN=SLC4A7 PE=3 SV=1"                                                                                                                                                                 
[4] "sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican core protein OS=Bos taurus OX=9913 GN=VCAN;sp|P81282-3|CSPG2_BOVIN Isoform V2 of Versican core protein OS=Bos taurus OX=9913 GN=VCAN;tr|F1MZ83|F1MZ83_BOVIN Versican core protein OS=Bos taurus OX=9913 GN=VCAN P"
[5] "tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic hyperkeratosis; keratosis palmaris et plantaris) OS=Bos taurus OX=9913 GN=KRT10 PE=2 SV=1;sp|P06394|K1C10_BOVIN Keratin, type I cytoskeletal 10 OS=Bos taurus OX=9913 GN=KRT10 PE=3 SV=1"

As you may have noticed, some protein names are almost an entire phrase, others are just a single word. The same goes for the genes, which are not always 2 characters, reaching up to 6 characters in this example.

Using the info I found here, I was able to build a Frankenstein of a code, but probably far from ideal:

library(stringr)
library(reshape2)

#split the protein name from the other delimiters
fasta.header <- str_split(proteinGroups$Fasta.headers, "(?=OS=)")

#discard the additional fasta headers
protGene <- sapply(fasta.header, "[", c(1,2))

#invert the orientation and change to DF
protGene <- as.data.frame(t(protGene))

#rename columns
colnames(protGene) <- c("protein.name", "gene")

#discard the extra info and keep protein name only
protGene$protein.name <- colsplit(protGene$protein.name, " ", c("X1","X2"))[2]

#split the crap that came along with the additional headers in the first step
temp1 <- strsplit(protGene$gene, ";")

#assign cleaner values to the table
protGene$gene <- sapply(temp1, "[", 1)

#split the rest of the annotation
temp2 <- strsplit(protGene$gene, "OS=| OX=| GN=| PE=| SV=")

#assign gene name to the table
protGene$gene <- sapply(temp2, "[", 4)

I was able to get the data, but I feel this is far from robust or optimized. Any ideas on what to change?

Thank you in advance!

Solution

I'm not sure, if this is what you are looking for. Suppose, your data is stored in a data.frame named proteinGroups and the headers are in column Fasta.headers.

library(stringr)
library(dplyr)

proteinGroups %>% 
  tibble() %>% 
  mutate(string = str_split(Fasta.headers, ";[a-z]{2}\\|[A-Z0-9\\-]*\\|"),
         rn = row_number()) %>% 
  unnest_longer(string) %>%
  mutate(
    protein_name = ifelse(str_detect(string, ".*_BOVIN\\s(.*?)\\sOS=.*"), 
                          str_replace(string, ".*_BOVIN\\s(.*?)\\sOS=.*", "\\1"),
                          NA_character_),
    gene = ifelse(str_detect(string, ".*GN=([A-Z0-9]*).*"),
                  str_replace(string, ".*GN=([A-Z0-9]*).*", "\\1"),
                  NA_character_),
    .keep = "unused"
  )

We split the strings at a pattern that looks like ;tr|A0A0F6QNP7| or ;sp|P81282-3| into smaller chunks.

We extract everything between _BOVIN and OS=. That's the protein' s name.
We extract everything after GN= that matches a mix from upper case letters and numbers. That's the gene.

So this returns

# A tibble: 14 x 4
   Fasta.headers                                        rn protein_name                          gene 
   <chr>                                             <int> <chr>                                 <chr>
 1 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~     1 Complement C3                         C3   
 2 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~     1 C3-beta-c                             C3   
 3 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~     1 C3-beta-c                             C3   
 4 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~     1 NA                                    NA   
 5 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~     2 Haptoglobin                           HP   
 6 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~     2 Haptoglobin                           HP   
 7 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~     2 Haptoglobin                           HP   
 8 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~     2 NA                                    NA   
 9 tr|A0A3Q1LH05|A0A3Q1LH05_BOVIN Anion exchange pr~     3 Anion exchange protein                SLC4~
10 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican c~     4 Isoform V3 of Versican core protein   VCAN 
11 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican c~     4 Isoform V2 of Versican core protein   VCAN 
12 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican c~     4 Versican core protein                 VCAN 
13 tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic~     5 Keratin 10 (Epidermolytic hyperkerat~ KRT10
14 tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic~     5 Keratin, type I cytoskeletal 10       KRT10

Since only the first header matters, the rest is crap, we just take the first row per string

proteinGroups %>% 
  tibble() %>% 
  mutate(string = str_split(Fasta.headers, ";[a-z]{2}\\|[A-Z0-9\\-]*\\|"),
         rn = row_number()) %>% 
  unnest_longer(string) %>%
  mutate(
    protein_name = ifelse(str_detect(string, ".*_BOVIN\\s(.*?)\\sOS=.*"), 
                          str_replace(string, ".*_BOVIN\\s(.*?)\\sOS=.*", "\\1"),
                          NA_character_),
    gene = ifelse(str_detect(string, ".*GN=([A-Z0-9]*).*"),
                  str_replace(string, ".*GN=([A-Z0-9]*).*", "\\1"),
                  NA_character_),
    .keep = "unused"
  ) %>% 
  group_by(rn) %>% 
  slice(1) %>% 
  ungroup() %>% 
  select(-rn)

to get

# A tibble: 5 x 3
  Fasta.headers                                         protein_name                             gene 
  <chr>                                                 <chr>                                    <chr>
1 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus OX=9~ Complement C3                            C3   
2 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bos ta~ Haptoglobin                              HP   
3 tr|A0A3Q1LH05|A0A3Q1LH05_BOVIN Anion exchange protei~ Anion exchange protein                   SLC4~
4 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican core ~ Isoform V3 of Versican core protein      VCAN 
5 tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic hyp~ Keratin 10 (Epidermolytic hyperkeratosi~ KRT10