I am trying to match at least four G-repeats, each repeat separated by a maximum of 7 characters. Example:
AAGGGAAGGGAAAGGGAAGGGAA
I use following regex which should match uppercase and lowercase characters.
$sequence =~ /((G{3,}[ATGC]{1,7}){3,}G{3,})/gi
This should match at least four G-repeats. The problem is, that I get a positive hit when I match the following sequence:
aaagaggaaaaggggaaaaggggaaaaggggaaa
The first repeat in this sequence contains three gs, separated by an a. Therefore, this sequence should not be matched.
Solution 1: The problem seemed to be the /i modifier. I could correct it by modifying the regex:
$sequence =~ /(([gG]{3,}[aAtTgGcC]{1,7}){3,}[gG]{3,})/g
Solution 2 provided by ikegami: Negative lookahead.
$sequence =~ /(([?!G]{3,}[ATGC]{1,7}){3,}[G]{3,})/gi
Thanks @ikegami for the hint and for submitting the bug report.
$ perl -E'say $& while "aaagaggaaaaggggaaaaggggaaaaggggaaa" =~ /((G{3,}[ATGC]{1,7}){3,}G{3,})/gi'
gggaaaaggggaaaagggg
You've found a bug! I filed a bug report.
This bug has been around since at least since 5.10, and it's present in the latest release (5.24.0).
Update: Fixed in Perl 5.26, released on 2017-05-30.