Search code examples
rtidyrtibble

Split column of strings into matrix of strings using regex with colnames based on part of match


I'm creating tibble from the gene descriptions in a FASTA file of protein coding sequences. Here's some example data I want to process

seqDescriptions<- c( "lcl|NC_003888.3_cds_NP_624362.1_1 [locus_tag=SCO0001 ][db_xref=GeneID:1095448] [protein=hypothetical protein] [protein_id=NP_624362.1] [location=446..1123] [gbkey=CDS]", 
"lcl|NC_003888.3_cds_NP_624363.1_2 [locus_tag=SCO0002] [db_xref=GeneID:1095447] [protein=hypothetical protein] [protein_id=NP_624363.1] [location=1252..3813] [gbkey=CDS]",
"lcl|NC_003888.3_cds_NP_624364.1_3 [locus_tag=SCO0003] [db_xref=GeneID:1095446] [protein=DNA-binding protein] [protein_id=NP_624364.1] [location=3869..6220] [gbkey=CDS]",
"lcl|NC_003888.3_cds_NP_631871.1_4 [locus_tag=SCO0004] [db_xref=GeneID:1095445] [protein=hypothetical protein] [protein_id=NP_631871.1] [location=6226..7173] [gbkey=CDS]")

I want to extract out the initial set of non-space characters in one column and then the information to the right of each tag. Defining the tags manually

tagList <- c("locus_tag", "db_xref", "protein", "protein_id", "location", "gbkey")

My goal is to have a tibble that looks like this

# A tibble: 4 x 7
name                                  locus_tag    db_xref       protein ...
<chr>                                 <chr>         <chr>        <chr>  ...
"lcl|NC_003888.3_cds_NP_624362.1_1"  "SCO0001"  "GeneID:1095448" "hypothetical protein" ...
"lcl|NC_003888.3_cds_NP_624363.1_2"  "SCO0002"  "GeneID:1095447" "hypothetical protein" ...

The code below works, but I'd like to

  1. See how to implement it in a tidyr way.
  2. Have the columns are named using the value of tag as it's being constructed rather than after the fact.
  3. Learn about any bioinformatic tools that could do this more directly, such as not having to manually define the tags.

.

fastaID <- sub("^(\\S+) .*", "\\1", seqDescriptions) 
seqTags <- sub("^\\S+ (.*)", "\\1", seqDescriptions)

dBase <- tibble(fasta_ID=fastaID)
for(tag in tagsUsed){
    tagPattern  <- paste0(".*\\[", tag, "=([^]]+).*")## Don't need to escape ']' with '^'
    dBase <- tibble::add_column(dBase, sub(tagPattern, "\\1", seqTags), .name_repair="unique" )
}

names(dBase) <- c("fasta_ID", tagsUsed)


tibble(tagsUsed))

Solution

  • Making use of tidyr::extract we could do:

    d <- data.frame(
      seqDescriptions = seqDescriptions
    )
    
    tagList <- c("locus_tag", "db_xref", "protein", "protein_id", "location", "gbkey")
    regex_tag <- lapply(tagList, function(.x) paste0("\\[", .x, "=(.*)\\]"))
    regex_tag <- unlist(regex_tag)  
    regex <- paste(c("^(\\S+)?", regex_tag), collapse = "\\s*")
    
    library(tidyr)
    
    d %>% 
      extract(seqDescriptions, into = c("name", tagList), regex)
    #>                                name locus_tag        db_xref
    #> 1 lcl|NC_003888.3_cds_NP_624362.1_1  SCO0001  GeneID:1095448
    #> 2 lcl|NC_003888.3_cds_NP_624363.1_2   SCO0002 GeneID:1095447
    #> 3 lcl|NC_003888.3_cds_NP_624364.1_3   SCO0003 GeneID:1095446
    #> 4 lcl|NC_003888.3_cds_NP_631871.1_4   SCO0004 GeneID:1095445
    #>                protein  protein_id   location gbkey
    #> 1 hypothetical protein NP_624362.1  446..1123   CDS
    #> 2 hypothetical protein NP_624363.1 1252..3813   CDS
    #> 3  DNA-binding protein NP_624364.1 3869..6220   CDS
    #> 4 hypothetical protein NP_631871.1 6226..7173   CDS