I need to create a table based on different delimiters and keywords. I have the following file:
>>>ENST00000370225_4
>7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4
[Homo sapiens]
7E7O_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4
[Homo sapiens]
Length=2317
Score = 4711 bits (12220), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 2273/2273 (100%), Positives = 2273/2273 (100%), Gaps = 0/2273 (0%)
>NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 [Homo
sapiens]
P78363.3 RecName: Full=Retinal-specific phospholipid-transporting ATPase
ABCA4; AltName: Full=ATP-binding cassette sub-family A member
4; AltName: Full=RIM ABC transporter; Short=RIM proteinv;
Short=RmP; AltName: Full=Retinal-specific ATP-binding cassette
transporter; AltName: Full=Stargardt disease protein [Homo
sapiens]
7LKP_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4
[Homo sapiens]
7M1P_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4
[Homo sapiens]
7M1Q_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4
[Homo sapiens]
EAW73056.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform
CRA_a [Homo sapiens]
Length=2273
Score = 4711 bits (12219), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 2273/2273 (100%), Positives = 2273/2273 (100%), Gaps = 0/2273 (0%)
>>>ENST00000460514_1
>CAH10486.1 hypothetical protein [Homo sapiens]
Length=1065
Score = 301 bits (772), Expect = 2e-96, Method: Compositional matrix adjust.
Identities = 146/146 (100%), Positives = 146/146 (100%), Gaps = 0/146 (0%)
>CAA75729.1 ABCR [Homo sapiens]
Length=2273
Score = 300 bits (769), Expect = 2e-94, Method: Compositional matrix adjust.
Identities = 146/146 (100%), Positives = 146/146 (100%), Gaps = 0/146 (0%)
And the desired output is:
Transcript Protein Length Score Identity Percent
1 ENST00000370225_4 7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 2317 4711 bits (12220) 2273/2273 100%
2 ENST00000370225_4 NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 2273 4711 bits (12219) 2273/2273 100%
3 ENST00000460514_1 CAH10486.1 hypothetical protein 1065 301 bits (772) 146/146 100%
4 ENST00000460514_1 CAA75729.1 ABCR 2273 300 bits (769) 146/146 100%
Each desired column is delimited in the original file by keywords like length, identities, scores, ">" and ">>>" I tried the following script, but I'm missing to add of the transcript (the first column), which is the first column delimited by ">>>".
my_txt <- readLines(con = "gene_filt_perc.txt")
transcript <-my_txt[grepl("^\\s*>>>", my_txt)]
lengths <-my_txt[grepl("^\\s*Length", my_txt)]
lengths <- gsub("Length=", "", lengths)
scores <-my_txt[grepl("^\\s*Score", my_txt)]
scores <- gsub(" Score = ", "", scores)
scores <- gsub("\\, Expect = ..*", "", scores)
identities <-my_txt[grepl("^\\s*Identities", my_txt)]
identities <- gsub(" Identities = ", "", identities)
identities <- gsub("\\, Positives = ..*", "", identities)
protein <-my_txt[grepl("^\\s*>[[:alnum:]]", my_txt)]
result <-data.frame("protein"=protein, "identities"=identities, "scores"=scores, "lengths"=lengths)
result
protein identities scores lengths
1 7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 2273/2273 (100%) 4711 bits (12220) 2317
2 7LKZ_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 2271/2273 (99%) 4711 bits (12220) 2273
3 NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 [Homo 2273/2273 (100%) 4711 bits (12219) 2273
4 BAE06122.2 ABCA4 variant protein [Homo sapiens] 2272/2273 (99%) 4710 bits (12218) 2273
5 7E7Q_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 2271/2273 (99%) 4709 bits (12214) 2317
Is there an easier way to construct the data.frame?
Doing gsub
on all data will destroy the hierarchical nesting. Keep your data together. Start with one big string stored in data.txt
and split them to get one element per transcript. Fields can be extracted using relative expressions. Lists can be converted into multiple rows in a table using unnest
. This automatically repeats transcript IDs for each corresponding protein.
library(tidyverse)
read_file("data.txt") %>%
str_split(">>>") %>%
simplify() %>%
discard(~ .x == "") %>%
# parse all transcripts
map(function(Transcript) {
list(
Transcript = Transcript %>% str_extract("ENST.*"),
Protein = {
Transcript %>%
str_split(">") %>%
simplify() %>%
discard(~ .x %>% str_detect("^ENST"))
}
)
}) %>%
enframe() %>%
select(value) %>%
unnest_wider(value) %>%
unnest(Protein) %>%
mutate(
# parse all proteins
Protein = Protein %>% map(function(Protein) {
list(
# first line
Protein = Protein %>% str_split("\n") %>% simplify() %>% first() %>% str_trim(),
# numbers after pattern 'Length='
Length = Protein %>% str_extract("(?<=Length=)[0-9 ]+") %>% as.numeric(),
# numbers after pattern 'Score = '
Score = Protein %>% str_extract("(?<=Score = )[0-9 ]+") %>% as.numeric()
)
})
) %>%
unnest_wider(Protein)
#> Transcript Protein Length Score
#> <chr> <chr> <dbl> <dbl>
#>1 ENST00000370225_4 7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 2317 4711
#>2 ENST00000370225_4 NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 [Homo 2273 4711
#>3 ENST00000460514_1 CAH10486.1 hypothetical protein [Homo sapiens] 1065 301
#>4 ENST00000460514_1 CAA75729.1 ABCR [Homo sapiens] 2273 300