I am trying to match DNA sequences that begin at the beginning or a multiple of 3 letters from the beginning, and start with either ATG or CGA, followed by 6,9,12,15,... letters and ending in AGT. The following code only gets one of the matches (the longest one). I have looked into "positive lookaheads" (e.g. ?=) but could not figure out how to successfully apply it to this situation.
dna=c("ABCATGABCGAAADFAGTAAAAGTAGTAAAGT")
str_match_all(dna, "^(...)*((?:ATG|CGA)(?:...){2,}(?:AGT))")
[[1]]
[,1] [,2] [,3]
[1,] "ABCATGABCGAAADFAGTAAAAGTAGT" "ABC" "ATGABCGAAADFAGTAAAAGTAGT"
Desired:
ABCATGABCGAAADFAGT ABC ATGABCGAAADFAGT
ABCATGABGCGAADFAGTAAAAGT ABC ATGABGCGAADFAGTAAAAGT
ABCATGABGCGAADFAGTAAAAGTAGT ABC ATGABGCGAADFAGTAAAAGTAGT
I know you're looking for a regex, but perhaps it's easier if you program it out:
.{3}
to split the string into triplets.dna <- c("ABCATGABCGAAADFAGTAAAAGTAGTAAAGT")
triplets <- str_extract_all(dna, ".{3}")[[1]]
tidyr::expand_grid(
start = which(triplets %in% c("ATG", "CGA")),
stop = which(triplets == "AGT")
) %>%
dplyr::filter(start < stop) %>%
dplyr::mutate(fragment = stringr::str_sub(dna, 3*(start-1) + 1, 3*stop))
# A tibble: 3 x 3
start stop fragment
<int> <int> <chr>
1 2 6 ATGABCGAAADFAGT
2 2 8 ATGABCGAAADFAGTAAAAGT
3 2 9 ATGABCGAAADFAGTAAAAGTAGT