I need to create a matching regex to find genetic sequences and I got stuck behind one specific problem - after first, start codon ATG
, follows other codons from three nucleotides as well and the regex ends with three possible codons TAA
, TAG
and TGA
. What if the stop(end) codon goes after the start(ATG
) codon? My current regex works when there are intermediate codons between start and stop codon, but if there are none, the regex matches ALL of the sequence after start codon. I know why it does that, but I have no idea how to change it to work the way I want it to.
My regex should look for AGGAGG
(exactly this pattern), then A
, C
, G
or T
(from 4 to 12 times) then ATG
(exactly this pattern), then A
, C
, G
or T
(in triples (for example, ACG
, TGC
and etc.), doesn't matter how long) UNTIL it matches TAA
, TAG
or TGA
. The search should end after that and start again after that.
Example of a good match:
XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX
AGGAGGTATGATGCGTACGGGCTAGTAGAGGAGGTATGATGTAGTAGCATGCT
There are two matches in the sequence - from 0 to 25 and from 28 to 44.
My current regex(don't mind the first two brackets):
$seq =~ /(AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3,3}){0,}(TAA|TAG|TGA)/ig
Problem here comes from the default usage of greedy quantifiers.
When using (AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3})*(TAA|TAG|TGA)
, 4th group ([ACTG]{3})*
will match as many as possible, then only 5th group is considered (backtracking if needed).
In your sequence you get TAGTAG
. Greedy quantifier will lead to first TAG
being captured in group 4, and second one captured as ending group.
You may use lazy quantifier instead: (AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3})*?(TAA|TAG|TGA)
(note the added question mark, making the quantifier lazy).
That way, first TAG
encountered will be treated as the ending group.
Demo.