Search code examples
rregexregex-group

Match regex with overlap for DNA


I am trying to match DNA sequences that begin at the beginning or a multiple of 3 letters from the beginning, and start with either ATG or CGA, followed by 6,9,12,15,... letters and ending in AGT. The following code only gets one of the matches (the longest one). I have looked into "positive lookaheads" (e.g. ?=) but could not figure out how to successfully apply it to this situation.

dna=c("ABCATGABCGAAADFAGTAAAAGTAGTAAAGT")
str_match_all(dna, "^(...)*((?:ATG|CGA)(?:...){2,}(?:AGT))")

[[1]]
     [,1]                          [,2]  [,3]      
[1,] "ABCATGABCGAAADFAGTAAAAGTAGT" "ABC" "ATGABCGAAADFAGTAAAAGTAGT"

Desired:
ABCATGABCGAAADFAGT ABC ATGABCGAAADFAGT
ABCATGABGCGAADFAGTAAAAGT ABC ATGABGCGAADFAGTAAAAGT
ABCATGABGCGAADFAGTAAAAGTAGT ABC ATGABGCGAADFAGTAAAAGTAGT

Solution

  • I know you're looking for a regex, but perhaps it's easier if you program it out:

    • Use a greedy regex .{3} to split the string into triplets.
    • Find the start and stop positions,
    • Create all possible combinations,
    • Filter the combinations that stop after they start and
    • Take the fragments of the original string
    dna <- c("ABCATGABCGAAADFAGTAAAAGTAGTAAAGT")
    triplets <- str_extract_all(dna, ".{3}")[[1]]
    tidyr::expand_grid(
      start = which(triplets %in% c("ATG", "CGA")),
      stop = which(triplets == "AGT")
    ) %>%
      dplyr::filter(start < stop) %>%
      dplyr::mutate(fragment = stringr::str_sub(dna, 3*(start-1) + 1, 3*stop))
    
    # A tibble: 3 x 3
      start  stop fragment                
      <int> <int> <chr>                   
    1     2     6 ATGABCGAAADFAGT         
    2     2     8 ATGABCGAAADFAGTAAAAGT   
    3     2     9 ATGABCGAAADFAGTAAAAGTAGT