Search code examples
regexperlregex-greedynegative-lookbehind

perl regex stop negative look-behind taking away from next greedy capture


Take this simple example in perl v5.22.0:

my $data = "foobar\n";
$data =~ s/(?<!bar)(\s*)$/qux$1/;
print $data;

It prints:

foobar
qux

but I didn't expect $data to change. I also tried some earlier versions of perl 5.x with the same result.

Conversely, I'd expect this string with the same regex to cause a replacement but it doesn't:

my $data = "foobaz\n";
$data =~ s/(?<!bar)(\s*)$/qux$1/;
print $data;


I don't understand why this happens. In either one the asterisk is supposed to be greedy. I figured $1 would be \n making the negative look-behind group compare against bar in the first example and baz in the second example. Regex101 when I use perl says:

Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed.

So in this case is what happens is it gives back to the negative look-behind?

As the title says the real issue is I'd like to stop the look-behind from swallowing that second group. Unfortunately it's not a single letter, that is just for the example to make it easier to understand. Also in perl I'm somewhat limited with what I can do with the negative look-behind, for example "Variable length lookbehind not implemented in regex". If it's possible I'd like an answer that is compatible with perl 5.8. Thanks


Solution

  • I think you want

    $data =~ s/(?<!bar)(?<!\s)(\s*)$/qux$1/;
    

    The following version will work with 5.8, and I think it's actually faster (since it jumps to the end of the string and backtracks rather than checking two look behinds at every position):

    $data =~ s/
       ^
       (
          (?:
             .*
             (?: [^r\s]
             |   [^a] r
             |   [^b] ar
             )
          )?
       )
       ( \s* )
       \z
    /${1}qux$2/sx;
    

    ($ could be used instead of \z; it's just a micro-optimization.)


    Explanation

    Without the m flag, $ is equivalent to (?:\n?\z), which it to say it matches at a newline at the end the string and at the end of the string. This means there are two possible places for $ to match foobar␊

    foobar␊      (There's a LF at position 6 in
    01234567      case your font can't show it.)
          ^^
    

    (?<!bar) prevents the first location from being considered, but it allows the second.

    • (?<!bar)(\s*)$ matches 0 characters at position 7, because

      • (?<=bar) matches 0 characters at position 7.
      • (\s*) matches 0 characters at position 7.
      • $ matches 0 characters at position 7.

    It's the only possible match, so greediness is not relevant.