I'm creating tibble from the gene descriptions in a FASTA file of protein coding sequences. Here's some example data I want to process
seqDescriptions<- c( "lcl|NC_003888.3_cds_NP_624362.1_1 [locus_tag=SCO0001 ][db_xref=GeneID:1095448] [protein=hypothetical protein] [protein_id=NP_624362.1] [location=446..1123] [gbkey=CDS]",
"lcl|NC_003888.3_cds_NP_624363.1_2 [locus_tag=SCO0002] [db_xref=GeneID:1095447] [protein=hypothetical protein] [protein_id=NP_624363.1] [location=1252..3813] [gbkey=CDS]",
"lcl|NC_003888.3_cds_NP_624364.1_3 [locus_tag=SCO0003] [db_xref=GeneID:1095446] [protein=DNA-binding protein] [protein_id=NP_624364.1] [location=3869..6220] [gbkey=CDS]",
"lcl|NC_003888.3_cds_NP_631871.1_4 [locus_tag=SCO0004] [db_xref=GeneID:1095445] [protein=hypothetical protein] [protein_id=NP_631871.1] [location=6226..7173] [gbkey=CDS]")
I want to extract out the initial set of non-space characters in one column and then the information to the right of each tag. Defining the tags manually
tagList <- c("locus_tag", "db_xref", "protein", "protein_id", "location", "gbkey")
My goal is to have a tibble that looks like this
# A tibble: 4 x 7
name locus_tag db_xref protein ...
<chr> <chr> <chr> <chr> ...
"lcl|NC_003888.3_cds_NP_624362.1_1" "SCO0001" "GeneID:1095448" "hypothetical protein" ...
"lcl|NC_003888.3_cds_NP_624363.1_2" "SCO0002" "GeneID:1095447" "hypothetical protein" ...
The code below works, but I'd like to
tidyr
way.tag
as it's being constructed rather than after the fact..
fastaID <- sub("^(\\S+) .*", "\\1", seqDescriptions)
seqTags <- sub("^\\S+ (.*)", "\\1", seqDescriptions)
dBase <- tibble(fasta_ID=fastaID)
for(tag in tagsUsed){
tagPattern <- paste0(".*\\[", tag, "=([^]]+).*")## Don't need to escape ']' with '^'
dBase <- tibble::add_column(dBase, sub(tagPattern, "\\1", seqTags), .name_repair="unique" )
}
names(dBase) <- c("fasta_ID", tagsUsed)
tibble(tagsUsed))
Making use of tidyr::extract
we could do:
d <- data.frame(
seqDescriptions = seqDescriptions
)
tagList <- c("locus_tag", "db_xref", "protein", "protein_id", "location", "gbkey")
regex_tag <- lapply(tagList, function(.x) paste0("\\[", .x, "=(.*)\\]"))
regex_tag <- unlist(regex_tag)
regex <- paste(c("^(\\S+)?", regex_tag), collapse = "\\s*")
library(tidyr)
d %>%
extract(seqDescriptions, into = c("name", tagList), regex)
#> name locus_tag db_xref
#> 1 lcl|NC_003888.3_cds_NP_624362.1_1 SCO0001 GeneID:1095448
#> 2 lcl|NC_003888.3_cds_NP_624363.1_2 SCO0002 GeneID:1095447
#> 3 lcl|NC_003888.3_cds_NP_624364.1_3 SCO0003 GeneID:1095446
#> 4 lcl|NC_003888.3_cds_NP_631871.1_4 SCO0004 GeneID:1095445
#> protein protein_id location gbkey
#> 1 hypothetical protein NP_624362.1 446..1123 CDS
#> 2 hypothetical protein NP_624363.1 1252..3813 CDS
#> 3 DNA-binding protein NP_624364.1 3869..6220 CDS
#> 4 hypothetical protein NP_631871.1 6226..7173 CDS