Search code examples
regexperlmismatch

Perl regular expression mismatches repetitive strings


I am trying to match at least four G-repeats, each repeat separated by a maximum of 7 characters. Example:

AAGGGAAGGGAAAGGGAAGGGAA

I use following regex which should match uppercase and lowercase characters.

$sequence =~ /((G{3,}[ATGC]{1,7}){3,}G{3,})/gi

This should match at least four G-repeats. The problem is, that I get a positive hit when I match the following sequence:

aaagaggaaaaggggaaaaggggaaaaggggaaa

The first repeat in this sequence contains three gs, separated by an a. Therefore, this sequence should not be matched.

Solution 1: The problem seemed to be the /i modifier. I could correct it by modifying the regex:

 $sequence =~ /(([gG]{3,}[aAtTgGcC]{1,7}){3,}[gG]{3,})/g

Solution 2 provided by ikegami: Negative lookahead.

$sequence =~ /(([?!G]{3,}[ATGC]{1,7}){3,}[G]{3,})/gi

Thanks @ikegami for the hint and for submitting the bug report.


Solution

  • $ perl -E'say $& while "aaagaggaaaaggggaaaaggggaaaaggggaaa" =~ /((G{3,}[ATGC]{1,7}){3,}G{3,})/gi'
    gggaaaaggggaaaagggg
    

    You've found a bug! I filed a bug report.

    This bug has been around since at least since 5.10, and it's present in the latest release (5.24.0).

    Update: Fixed in Perl 5.26, released on 2017-05-30.