Search code examples
rstringoptimizationfasta

Split uneven string in R - variable substring and delimiters


Before posting my question, I would like to emphasize that I did find similar things here but nothing quite like what I need.

I am dealing with FASTA files, more precisely with the FASTA headers, which look like this: sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus OX=9913 GN=C3 PE=1 SV=2

I need to extract the bolded text. The first bolded text is the protein name. The second bold is the gene name. Please, note that they vary, and I start the analysis with multiple fasta headers inside the same string. Only the first header matters, the rest is crap. Here is an example:

> proteinGroups$Fasta.headers
[1] "sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus OX=9913 GN=C3 PE=1 SV=2;tr|A0A0F6QNP7|A0A0F6QNP7_BOVIN C3-beta-c OS=Bos taurus OX=9913 GN=C3 PE=2 SV=1;tr|A0A3Q1MHV6|A0A3Q1MHV6_BOVIN C3-beta-c OS=Bos taurus OX=9913 GN=C3 PE=1 SV=1;tr|A0A3Q1M2B2|A0A3Q1M2B2_B"
[2] "tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bos taurus OX=9913 GN=HP PE=3 SV=1;sp|Q2TBU0|HPT_BOVIN Haptoglobin OS=Bos taurus OX=9913 GN=HP PE=2 SV=1;tr|A0A0M4MD57|A0A0M4MD57_BOVIN Haptoglobin OS=Bos taurus OX=9913 GN=HP PE=2 SV=1;tr|G3X6K8|G3X6K8_BOVIN H"
[3] "tr|A0A3Q1LH05|A0A3Q1LH05_BOVIN Anion exchange protein OS=Bos taurus OX=9913 GN=SLC4A7 PE=3 SV=1"                                                                                                                                                                 
[4] "sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican core protein OS=Bos taurus OX=9913 GN=VCAN;sp|P81282-3|CSPG2_BOVIN Isoform V2 of Versican core protein OS=Bos taurus OX=9913 GN=VCAN;tr|F1MZ83|F1MZ83_BOVIN Versican core protein OS=Bos taurus OX=9913 GN=VCAN P"
[5] "tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic hyperkeratosis; keratosis palmaris et plantaris) OS=Bos taurus OX=9913 GN=KRT10 PE=2 SV=1;sp|P06394|K1C10_BOVIN Keratin, type I cytoskeletal 10 OS=Bos taurus OX=9913 GN=KRT10 PE=3 SV=1"

As you may have noticed, some protein names are almost an entire phrase, others are just a single word. The same goes for the genes, which are not always 2 characters, reaching up to 6 characters in this example.

Using the info I found here, I was able to build a Frankenstein of a code, but probably far from ideal:

library(stringr)
library(reshape2)

#split the protein name from the other delimiters
fasta.header <- str_split(proteinGroups$Fasta.headers, "(?=OS=)")

#discard the additional fasta headers
protGene <- sapply(fasta.header, "[", c(1,2))

#invert the orientation and change to DF
protGene <- as.data.frame(t(protGene))

#rename columns
colnames(protGene) <- c("protein.name", "gene")

#discard the extra info and keep protein name only
protGene$protein.name <- colsplit(protGene$protein.name, " ", c("X1","X2"))[2]

#split the crap that came along with the additional headers in the first step
temp1 <- strsplit(protGene$gene, ";")

#assign cleaner values to the table
protGene$gene <- sapply(temp1, "[", 1)

#split the rest of the annotation
temp2 <- strsplit(protGene$gene, "OS=| OX=| GN=| PE=| SV=")

#assign gene name to the table
protGene$gene <- sapply(temp2, "[", 4)

I was able to get the data, but I feel this is far from robust or optimized. Any ideas on what to change?

Thank you in advance!


Solution

  • I'm not sure, if this is what you are looking for. Suppose, your data is stored in a data.frame named proteinGroups and the headers are in column Fasta.headers.

    library(stringr)
    library(dplyr)
    
    proteinGroups %>% 
      tibble() %>% 
      mutate(string = str_split(Fasta.headers, ";[a-z]{2}\\|[A-Z0-9\\-]*\\|"),
             rn = row_number()) %>% 
      unnest_longer(string) %>%
      mutate(
        protein_name = ifelse(str_detect(string, ".*_BOVIN\\s(.*?)\\sOS=.*"), 
                              str_replace(string, ".*_BOVIN\\s(.*?)\\sOS=.*", "\\1"),
                              NA_character_),
        gene = ifelse(str_detect(string, ".*GN=([A-Z0-9]*).*"),
                      str_replace(string, ".*GN=([A-Z0-9]*).*", "\\1"),
                      NA_character_),
        .keep = "unused"
      )
    

    We split the strings at a pattern that looks like ;tr|A0A0F6QNP7| or ;sp|P81282-3| into smaller chunks.

    • We extract everything between _BOVIN and OS=. That's the protein' s name.
    • We extract everything after GN= that matches a mix from upper case letters and numbers. That's the gene.

    So this returns

    # A tibble: 14 x 4
       Fasta.headers                                        rn protein_name                          gene 
       <chr>                                             <int> <chr>                                 <chr>
     1 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~     1 Complement C3                         C3   
     2 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~     1 C3-beta-c                             C3   
     3 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~     1 C3-beta-c                             C3   
     4 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~     1 NA                                    NA   
     5 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~     2 Haptoglobin                           HP   
     6 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~     2 Haptoglobin                           HP   
     7 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~     2 Haptoglobin                           HP   
     8 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~     2 NA                                    NA   
     9 tr|A0A3Q1LH05|A0A3Q1LH05_BOVIN Anion exchange pr~     3 Anion exchange protein                SLC4~
    10 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican c~     4 Isoform V3 of Versican core protein   VCAN 
    11 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican c~     4 Isoform V2 of Versican core protein   VCAN 
    12 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican c~     4 Versican core protein                 VCAN 
    13 tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic~     5 Keratin 10 (Epidermolytic hyperkerat~ KRT10
    14 tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic~     5 Keratin, type I cytoskeletal 10       KRT10
    

    Since only the first header matters, the rest is crap, we just take the first row per string

    proteinGroups %>% 
      tibble() %>% 
      mutate(string = str_split(Fasta.headers, ";[a-z]{2}\\|[A-Z0-9\\-]*\\|"),
             rn = row_number()) %>% 
      unnest_longer(string) %>%
      mutate(
        protein_name = ifelse(str_detect(string, ".*_BOVIN\\s(.*?)\\sOS=.*"), 
                              str_replace(string, ".*_BOVIN\\s(.*?)\\sOS=.*", "\\1"),
                              NA_character_),
        gene = ifelse(str_detect(string, ".*GN=([A-Z0-9]*).*"),
                      str_replace(string, ".*GN=([A-Z0-9]*).*", "\\1"),
                      NA_character_),
        .keep = "unused"
      ) %>% 
      group_by(rn) %>% 
      slice(1) %>% 
      ungroup() %>% 
      select(-rn)
    

    to get

    # A tibble: 5 x 3
      Fasta.headers                                         protein_name                             gene 
      <chr>                                                 <chr>                                    <chr>
    1 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus OX=9~ Complement C3                            C3   
    2 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bos ta~ Haptoglobin                              HP   
    3 tr|A0A3Q1LH05|A0A3Q1LH05_BOVIN Anion exchange protei~ Anion exchange protein                   SLC4~
    4 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican core ~ Isoform V3 of Versican core protein      VCAN 
    5 tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic hyp~ Keratin 10 (Epidermolytic hyperkeratosi~ KRT10