Search code examples
regexperl

Why does =~ only evaluate once?


In this example script:

#perl 5.26.1 
$foo = "batcathat";

if ($foo =~ /cat/g) {
    print "yes\n";
} else {
    print "no\n";
}

if ($foo =~ /cat/g) {
    print "yes\n";
} else {
    print "no\n";
}

This will print:

yes
no

The expected output is:

yes
yes

I can confirm by printing the string that it has not been mutated by running the regex match.

Why does Perl seemingly only evaluate a regex expression once? I could find no information about this on Google or manuals, and the behaviour is not intuitive to me. I expect that each time you evaluate a regex match, it starts from fresh, and does not remember anything about the previous match.

Edit: For future context, this question was asked after I found a bit of code looking like this:

while ( $foo =~ /pattern/g) { $some_incrementing_var++ };

I did not understand initially how this while loop could ever terminate, as on first glance it looked like an infinite loop.


Solution

  • The /g in scalar context is one of my favorite tools, and it's documented at unusual length in perlop. That seems a weird place for it's docs, but there are flags that affect how the match operator works, and there are flags that affect how the pattern works (see Know the difference between regex and match operator flags).

    @ikegami mentions pos, which you can read about in perlfunc. Perl tracks where it is in the a string. Without /g in scalar context, the match operator starts at the beginning of the string and moves toward the end. After it matches, it's done. The next match starts at the beginning of the string again.

    The \g changes this slightly. The first match starts at the beginning of the string, matches (if it can), and set the position one after where the match ended. That's the pos. The next time with \g, the match starts at pos. In your case,

    In list context, it's doing the same thing, but going to exhaustion. It makes the first match, then tries the next match starting at that position. This is why a match in list context cannot find overlapping matches: it starts matching after where the overlap would have started. Jeffrey Friedl's Mastering Regular Expressions, despite being old, is a good survey of how regexes work, different ways they could work, and how various ways preclude features that other ways might have.

    use v5.26;
    
    my $foo = "batcathat";
    
    say "pos is ", pos($foo) // 0; # not started, so undef
    
    if ($foo =~ /cat/g) {
        print "yes\n";
    } else {
        print "no\n";
    }
    
    say "pos is ", pos($foo) // 0;  # now is 6
    
    if ($foo =~ /cat/g) {
        print "yes\n";
    } else {
        print "no\n";
    }
    

    This outputs:

    pos is 0
    yes
    pos is 6
    no
    

    This position is tracked per string and is reset after a failed match with \g (just like other regex side effect variables):

    use v5.26;
    
    my $foo = "batcathat";
    
    say "pos is ", pos($foo) // 0; # not started, so undef
    
    if ($foo =~ /cat/g) {
        print "yes\n";
    } else {
        print "no\n";
    }
    
    $foo =~ /dog/g;
    
    say "pos is ", pos($foo) // 0;  # now is 0 again after failed match
    
    if ($foo =~ /cat/g) {
        print "yes\n";
    } else {
        print "no\n";
    }
    

    The output shows that both attempts to find cat work because the dog attempt failed and reset pos:

    pos is 0
    yes
    pos is 0
    yes
    

    But, there's a way to get around that too. The /c flag tells the match operator to not reset pos on failure:

    use v5.26;
    
    my $foo = "batcathat";
    
    say "pos is ", pos($foo) // 0; # not started, so undef
    
    if ($foo =~ /cat/g) {
        print "yes\n";
    } else {
        print "no\n";
    }
    
    $foo =~ /dog/gc; # will not reset pos
    
    say "pos is ", pos($foo) // 0;  # now is 6 because /c
    
    if ($foo =~ /cat/g) {
        print "yes\n";
    } else {
        print "no\n";
    }
    

    Now you are back to the original output:

    pos is 0
    yes
    pos is 6
    no
    

    This allows you do things like this very simple example. You can match for a certain thing, and if that doesn't work out, try something else without losing your place in the string:

    use v5.26;
    
    my $foo = "batcathat";
    
    $foo =~ /bat/g;
    
    if( $foo =~ /\Gmat/gc ) {
        do_mat_things();
    } elsif( $foo =~ /\Gchat/gc ) {
        do_chat_things();
    } elsif( $foo =~ /\Gcat/gc ) {
        do_cat_things();
    }
    

    This allows you to walk through the string in very complex situations and do things in the middle of matching. I think I have some examples in Mastering Perl.