Search code examples
regexperlgenetic-algorithmgenetics

Regex for matching a specific pattern only if it doesn't match other pattern


I need to create a matching regex to find genetic sequences and I got stuck behind one specific problem - after first, start codon ATG, follows other codons from three nucleotides as well and the regex ends with three possible codons TAA, TAG and TGA. What if the stop(end) codon goes after the start(ATG) codon? My current regex works when there are intermediate codons between start and stop codon, but if there are none, the regex matches ALL of the sequence after start codon. I know why it does that, but I have no idea how to change it to work the way I want it to.

My regex should look for AGGAGG (exactly this pattern), then A, C, G or T (from 4 to 12 times) then ATG (exactly this pattern), then A, C, G or T (in triples (for example, ACG, TGC and etc.), doesn't matter how long) UNTIL it matches TAA, TAG or TGA. The search should end after that and start again after that.

Example of a good match:

XXXXXXXXXXXXXXXXXXXXXXXXX   XXXXXXXXXXXXXXXX
AGGAGGTATGATGCGTACGGGCTAGTAGAGGAGGTATGATGTAGTAGCATGCT

There are two matches in the sequence - from 0 to 25 and from 28 to 44.

My current regex(don't mind the first two brackets):

$seq =~ /(AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3,3}){0,}(TAA|TAG|TGA)/ig

Solution

  • Problem here comes from the default usage of greedy quantifiers.

    When using (AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3})*(TAA|TAG|TGA), 4th group ([ACTG]{3})* will match as many as possible, then only 5th group is considered (backtracking if needed).
    In your sequence you get TAGTAG. Greedy quantifier will lead to first TAG being captured in group 4, and second one captured as ending group.

    You may use lazy quantifier instead: (AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3})*?(TAA|TAG|TGA) (note the added question mark, making the quantifier lazy).
    That way, first TAG encountered will be treated as the ending group.

    Demo.