Search code examples
regexperl

Perl: Method to convert regexp with greedy quantifiers to non-greedy


My user gives a regexp with quantifiers that default to being greedy. He can give any valid regexp. So the solution will have to deal with anything that the user can throw at me.

How do I convert the regexp so any greedy quantifier will be non-greedy?

Does Perl have a (?...:regexp) construct that forces the greedy default for quantifiers into a non-greedy one?

If not: Is there a different way I can force a regexp with greedy quantifiers into a non-greedy one?

E.g., a user may enter:

.*
[.*]
[.*]{4,10}
[.*{4,10}]{4,10}

While these four examples may look similar, they have completely different meanings.

If you simply add ? after every */} you will change the character sets in the last three examples.

Instead they should be changed to/behave like:

.*?
[.*]
[.*]{4,10}?
[.*{4,10}]{4,10}?

but where the matched string is the minimal match, and not first-match, that Perl will default to:

$a="aab";

$a=~/(a.*?b)$/;
# Matches aab, not ab
print $1;

But given the non-greedy regexp, the minimal match can probably be obtained by prepending .*:

$a="aab";

$a=~/.*(a.*?b)$/;
# Matches ab
print $1;

Solution

  • You can use a state machine:

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    my @regexes = ( ".*", "[.*]", "[.*]{4,10}", "[.*{4,10}]{4,10}" );
    
    for (@regexes) {
        print "give: $_\n";
        my $ungreedy = make_ungreedy($_,0);
        print "got:  $ungreedy\n";
        print "============================================\n"
    }
    
    
    sub make_ungreedy {
        my $regex = shift;
    
        my $class_state  = 0;
        my $escape_state = 0;
        my $found        = 0;
        my $ungreedy     = "";
    
        for (split (//, $regex)) {
            if ($found) {
                $ungreedy .= "?" unless (/\?/);
                $found = 0;
            }
            $ungreedy .= $_;
    
            $escape_state = 0, next if ($escape_state);
            $escape_state = 1, next if (/\\/);
            $class_state  = 1, next if (/\[/);
            if ($class_state) {
                $class_state = 0 if (/\]/);
                next;
            }
            $found = 1 if (/[*}+]/);
        }
        $ungreedy .= '?' if $found;
        return $ungreedy;
    }