Search code examples
rpattern-matchingstringrstringi

How to extract all characters before and after a certain set of characters in R while making sure those characters are first/last in the string?


I have a long string:

my_string = "GTCAGTCGATCTGGGCATTATGCGTCAAAAGGCTGCTAGCTAAAGCTGATCAGCATCAAAAGGCCGCCCCTATGCTACGAGCATCATGCATCTGGGTCTAGCTAGTGGGCATTCTCTCTGCTGCATTCAGTCACAAAAGGTGTCAGTCGTAGTCATCATCTACATCGTTCATGCTGGGCATTACAGTCAGTCACAAAAGGTCAGTCAGTCA"

I want to extract two things from this string:

  1. Everything "before" the first encountered CAAAAG
  2. Everything "after" the last encountered TGGGCATT

Everything before CAAAAG can be found like this:

stringr::word(my_string, 1, sep = "CAAAAG")

But how do I make sure that it is "first" CAAAAG in the string? And that I am receiving all characters found before the very first CAAAAG?

The same goes for TGGGCATT. I can receive everything "after" TGGGCATT in this way:

stringr::word(my_string, -1, sep = "TGGGCATT")

But how do I make sure that I am getting all characters coming "after" the LAST TGGGCATT in my string?


Solution

  • I think I've got two ways that I used for each.

    my_string = "GTCAGTCGATCTGGGCATTATGCGTCAAAAGGCTGCTAGCTAAAGCTGATCAGCATCAAAAGGCCGCCCCTATGCTACGAGCATCATGCATCTGGGTCTAGCTAGTGGGCATTCTCTCTGCTGCATTCAGTCACAAAAGGTGTCAGTCGTAGTCATCATCTACATCGTTCATGCTGGGCATTACAGTCAGTCACAAAAGGTCAGTCAGTCA"
    
    library(stringr)
    
    str_match_all(my_string, '(.*?)CAAAAG')
    
    #[[1]]
    #     [,1]                                                                           
    #[1,] "GTCAGTCGATCTGGGCATTATGCGTCAAAAG"                                              
    #[2,] "GCTGCTAGCTAAAGCTGATCAGCATCAAAAG"                                              
    #[3,] #"GCCGCCCCTATGCTACGAGCATCATGCATCTGGGTCTAGCTAGTGGGCATTCTCTCTGCTGCATTCAGTCACAAAAG"
    #[4,] "GTGTCAGTCGTAGTCATCATCTACATCGTTCATGCTGGGCATTACAGTCAGTCACAAAAG"                 
    #     [,2]                                                                     
    #[1,] "GTCAGTCGATCTGGGCATTATGCGT"                                              
    #[2,] "GCTGCTAGCTAAAGCTGATCAGCAT"                                              
    #[3,] "GCCGCCCCTATGCTACGAGCATCATGCATCTGGGTCTAGCTAGTGGGCATTCTCTCTGCTGCATTCAGTCA"
    #[4,] "GTGTCAGTCGTAGTCATCATCTACATCGTTCATGCTGGGCATTACAGTCAGTCA"  
    
    first.match <- str_match_all(my_string, '(.*?)CAAAAG')[[1]][1,2]
    first.match
    #[1] "GTCAGTCGATCTGGGCATTATGCGT"
    
    str_locate_all(my_string, 'TGGGCATT')
    #[[1]]
    #     start end
    #[1,]    12  19
    #[2,]   106 113
    #[3,]   175 182
    second.match.index <- str_locate_all(my_string, 'TGGGCATT')[[1]]
    second.match <- substr(my_string,second.match.index[nrow(second.match.index),ncol(second.match.index)]+1,
                           nchar(my_string))
    
    second.match
    #[1] "TACAGTCAGTCACAAAAGGTCAGTCAGTCA"
    

    Edit: Added '+1' because you want the very next index, not the one where the searched string ends.