I have a long string:
my_string = "GTCAGTCGATCTGGGCATTATGCGTCAAAAGGCTGCTAGCTAAAGCTGATCAGCATCAAAAGGCCGCCCCTATGCTACGAGCATCATGCATCTGGGTCTAGCTAGTGGGCATTCTCTCTGCTGCATTCAGTCACAAAAGGTGTCAGTCGTAGTCATCATCTACATCGTTCATGCTGGGCATTACAGTCAGTCACAAAAGGTCAGTCAGTCA"
I want to extract two things from this string:
Everything before CAAAAG can be found like this:
stringr::word(my_string, 1, sep = "CAAAAG")
But how do I make sure that it is "first" CAAAAG in the string? And that I am receiving all characters found before the very first CAAAAG?
The same goes for TGGGCATT. I can receive everything "after" TGGGCATT in this way:
stringr::word(my_string, -1, sep = "TGGGCATT")
But how do I make sure that I am getting all characters coming "after" the LAST TGGGCATT in my string?
I think I've got two ways that I used for each.
my_string = "GTCAGTCGATCTGGGCATTATGCGTCAAAAGGCTGCTAGCTAAAGCTGATCAGCATCAAAAGGCCGCCCCTATGCTACGAGCATCATGCATCTGGGTCTAGCTAGTGGGCATTCTCTCTGCTGCATTCAGTCACAAAAGGTGTCAGTCGTAGTCATCATCTACATCGTTCATGCTGGGCATTACAGTCAGTCACAAAAGGTCAGTCAGTCA"
library(stringr)
str_match_all(my_string, '(.*?)CAAAAG')
#[[1]]
# [,1]
#[1,] "GTCAGTCGATCTGGGCATTATGCGTCAAAAG"
#[2,] "GCTGCTAGCTAAAGCTGATCAGCATCAAAAG"
#[3,] #"GCCGCCCCTATGCTACGAGCATCATGCATCTGGGTCTAGCTAGTGGGCATTCTCTCTGCTGCATTCAGTCACAAAAG"
#[4,] "GTGTCAGTCGTAGTCATCATCTACATCGTTCATGCTGGGCATTACAGTCAGTCACAAAAG"
# [,2]
#[1,] "GTCAGTCGATCTGGGCATTATGCGT"
#[2,] "GCTGCTAGCTAAAGCTGATCAGCAT"
#[3,] "GCCGCCCCTATGCTACGAGCATCATGCATCTGGGTCTAGCTAGTGGGCATTCTCTCTGCTGCATTCAGTCA"
#[4,] "GTGTCAGTCGTAGTCATCATCTACATCGTTCATGCTGGGCATTACAGTCAGTCA"
first.match <- str_match_all(my_string, '(.*?)CAAAAG')[[1]][1,2]
first.match
#[1] "GTCAGTCGATCTGGGCATTATGCGT"
str_locate_all(my_string, 'TGGGCATT')
#[[1]]
# start end
#[1,] 12 19
#[2,] 106 113
#[3,] 175 182
second.match.index <- str_locate_all(my_string, 'TGGGCATT')[[1]]
second.match <- substr(my_string,second.match.index[nrow(second.match.index),ncol(second.match.index)]+1,
nchar(my_string))
second.match
#[1] "TACAGTCAGTCACAAAAGGTCAGTCAGTCA"
Edit: Added '+1' because you want the very next index, not the one where the searched string ends.