Before posting my question, I would like to emphasize that I did find similar things here but nothing quite like what I need.
I am dealing with FASTA files, more precisely with the FASTA headers, which look like this: sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus OX=9913 GN=C3 PE=1 SV=2
I need to extract the bolded text. The first bolded text is the protein name. The second bold is the gene name. Please, note that they vary, and I start the analysis with multiple fasta headers inside the same string. Only the first header matters, the rest is crap. Here is an example:
> proteinGroups$Fasta.headers
[1] "sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus OX=9913 GN=C3 PE=1 SV=2;tr|A0A0F6QNP7|A0A0F6QNP7_BOVIN C3-beta-c OS=Bos taurus OX=9913 GN=C3 PE=2 SV=1;tr|A0A3Q1MHV6|A0A3Q1MHV6_BOVIN C3-beta-c OS=Bos taurus OX=9913 GN=C3 PE=1 SV=1;tr|A0A3Q1M2B2|A0A3Q1M2B2_B"
[2] "tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bos taurus OX=9913 GN=HP PE=3 SV=1;sp|Q2TBU0|HPT_BOVIN Haptoglobin OS=Bos taurus OX=9913 GN=HP PE=2 SV=1;tr|A0A0M4MD57|A0A0M4MD57_BOVIN Haptoglobin OS=Bos taurus OX=9913 GN=HP PE=2 SV=1;tr|G3X6K8|G3X6K8_BOVIN H"
[3] "tr|A0A3Q1LH05|A0A3Q1LH05_BOVIN Anion exchange protein OS=Bos taurus OX=9913 GN=SLC4A7 PE=3 SV=1"
[4] "sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican core protein OS=Bos taurus OX=9913 GN=VCAN;sp|P81282-3|CSPG2_BOVIN Isoform V2 of Versican core protein OS=Bos taurus OX=9913 GN=VCAN;tr|F1MZ83|F1MZ83_BOVIN Versican core protein OS=Bos taurus OX=9913 GN=VCAN P"
[5] "tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic hyperkeratosis; keratosis palmaris et plantaris) OS=Bos taurus OX=9913 GN=KRT10 PE=2 SV=1;sp|P06394|K1C10_BOVIN Keratin, type I cytoskeletal 10 OS=Bos taurus OX=9913 GN=KRT10 PE=3 SV=1"
As you may have noticed, some protein names are almost an entire phrase, others are just a single word. The same goes for the genes, which are not always 2 characters, reaching up to 6 characters in this example.
Using the info I found here, I was able to build a Frankenstein of a code, but probably far from ideal:
library(stringr)
library(reshape2)
#split the protein name from the other delimiters
fasta.header <- str_split(proteinGroups$Fasta.headers, "(?=OS=)")
#discard the additional fasta headers
protGene <- sapply(fasta.header, "[", c(1,2))
#invert the orientation and change to DF
protGene <- as.data.frame(t(protGene))
#rename columns
colnames(protGene) <- c("protein.name", "gene")
#discard the extra info and keep protein name only
protGene$protein.name <- colsplit(protGene$protein.name, " ", c("X1","X2"))[2]
#split the crap that came along with the additional headers in the first step
temp1 <- strsplit(protGene$gene, ";")
#assign cleaner values to the table
protGene$gene <- sapply(temp1, "[", 1)
#split the rest of the annotation
temp2 <- strsplit(protGene$gene, "OS=| OX=| GN=| PE=| SV=")
#assign gene name to the table
protGene$gene <- sapply(temp2, "[", 4)
I was able to get the data, but I feel this is far from robust or optimized. Any ideas on what to change?
Thank you in advance!
I'm not sure, if this is what you are looking for. Suppose, your data is stored in a data.frame named proteinGroups
and the headers are in column Fasta.headers
.
library(stringr)
library(dplyr)
proteinGroups %>%
tibble() %>%
mutate(string = str_split(Fasta.headers, ";[a-z]{2}\\|[A-Z0-9\\-]*\\|"),
rn = row_number()) %>%
unnest_longer(string) %>%
mutate(
protein_name = ifelse(str_detect(string, ".*_BOVIN\\s(.*?)\\sOS=.*"),
str_replace(string, ".*_BOVIN\\s(.*?)\\sOS=.*", "\\1"),
NA_character_),
gene = ifelse(str_detect(string, ".*GN=([A-Z0-9]*).*"),
str_replace(string, ".*GN=([A-Z0-9]*).*", "\\1"),
NA_character_),
.keep = "unused"
)
We split the strings at a pattern that looks like ;tr|A0A0F6QNP7|
or ;sp|P81282-3|
into smaller chunks.
_BOVIN
and OS=
. That's the protein' s name.GN=
that matches a mix from upper case letters and numbers. That's the gene.So this returns
# A tibble: 14 x 4
Fasta.headers rn protein_name gene
<chr> <int> <chr> <chr>
1 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~ 1 Complement C3 C3
2 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~ 1 C3-beta-c C3
3 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~ 1 C3-beta-c C3
4 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus ~ 1 NA NA
5 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~ 2 Haptoglobin HP
6 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~ 2 Haptoglobin HP
7 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~ 2 Haptoglobin HP
8 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bo~ 2 NA NA
9 tr|A0A3Q1LH05|A0A3Q1LH05_BOVIN Anion exchange pr~ 3 Anion exchange protein SLC4~
10 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican c~ 4 Isoform V3 of Versican core protein VCAN
11 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican c~ 4 Isoform V2 of Versican core protein VCAN
12 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican c~ 4 Versican core protein VCAN
13 tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic~ 5 Keratin 10 (Epidermolytic hyperkerat~ KRT10
14 tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic~ 5 Keratin, type I cytoskeletal 10 KRT10
Since only the first header matters, the rest is crap, we just take the first row per string
proteinGroups %>%
tibble() %>%
mutate(string = str_split(Fasta.headers, ";[a-z]{2}\\|[A-Z0-9\\-]*\\|"),
rn = row_number()) %>%
unnest_longer(string) %>%
mutate(
protein_name = ifelse(str_detect(string, ".*_BOVIN\\s(.*?)\\sOS=.*"),
str_replace(string, ".*_BOVIN\\s(.*?)\\sOS=.*", "\\1"),
NA_character_),
gene = ifelse(str_detect(string, ".*GN=([A-Z0-9]*).*"),
str_replace(string, ".*GN=([A-Z0-9]*).*", "\\1"),
NA_character_),
.keep = "unused"
) %>%
group_by(rn) %>%
slice(1) %>%
ungroup() %>%
select(-rn)
to get
# A tibble: 5 x 3
Fasta.headers protein_name gene
<chr> <chr> <chr>
1 sp|Q2UVX4|CO3_BOVIN Complement C3 OS=Bos taurus OX=9~ Complement C3 C3
2 tr|A0A3Q1MB98|A0A3Q1MB98_BOVIN Haptoglobin OS=Bos ta~ Haptoglobin HP
3 tr|A0A3Q1LH05|A0A3Q1LH05_BOVIN Anion exchange protei~ Anion exchange protein SLC4~
4 sp|P81282-4|CSPG2_BOVIN Isoform V3 of Versican core ~ Isoform V3 of Versican core protein VCAN
5 tr|A6QNZ7|A6QNZ7_BOVIN Keratin 10 (Epidermolytic hyp~ Keratin 10 (Epidermolytic hyperkeratosi~ KRT10