regex perl regex-greedy negative-lookbehind

perl regex stop negative look-behind taking away from next greedy capture

Take this simple example in perl v5.22.0:

my $data = "foobar\n";
$data =~ s/(?<!bar)(\s*)$/qux$1/;
print $data;

It prints:

foobar
qux

but I didn't expect $data to change. I also tried some earlier versions of perl 5.x with the same result.

Conversely, I'd expect this string with the same regex to cause a replacement but it doesn't:

my $data = "foobaz\n";
$data =~ s/(?<!bar)(\s*)$/qux$1/;
print $data;

I don't understand why this happens. In either one the asterisk is supposed to be greedy. I figured $1 would be \n making the negative look-behind group compare against bar in the first example and baz in the second example. Regex101 when I use perl says:

Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed.

So in this case is what happens is it gives back to the negative look-behind?

As the title says the real issue is I'd like to stop the look-behind from swallowing that second group. Unfortunately it's not a single letter, that is just for the example to make it easier to understand. Also in perl I'm somewhat limited with what I can do with the negative look-behind, for example "Variable length lookbehind not implemented in regex". If it's possible I'd like an answer that is compatible with perl 5.8. Thanks

Solution

I think you want

$data =~ s/(?<!bar)(?<!\s)(\s*)$/qux$1/;

The following version will work with 5.8, and I think it's actually faster (since it jumps to the end of the string and backtracks rather than checking two look behinds at every position):

$data =~ s/
   ^
   (
      (?:
         .*
         (?: [^r\s]
         |   [^a] r
         |   [^b] ar
         )
      )?
   )
   ( \s* )
   \z
/${1}qux$2/sx;

($ could be used instead of \z; it's just a micro-optimization.)

Explanation

Without the m flag, $ is equivalent to (?:\n?\z), which it to say it matches at a newline at the end the string and at the end of the string. This means there are two possible places for $ to match foobar␊

foobar␊      (There's a LF at position 6 in
01234567      case your font can't show it.)
      ^^

(?<!bar) prevents the first location from being considered, but it allows the second.

(?<!bar)(\s*)$ matches 0 characters at position 7, because
- (?<=bar) matches 0 characters at position 7.
- (\s*) matches 0 characters at position 7.
- $ matches 0 characters at position 7.

It's the only possible match, so greediness is not relevant.