Search code examples
regexperlwhitespaceregex-greedymodifier

perl regex to remove initial all-whitespace lines from a string: why does it work?


The regex s/\A\s*\n// removes every all-whitespace line from the beginning of a string. It leaves everything else alone, including any whitespace that might begin the first visible line. By "visible line," I mean a line that satisfies /\S/. The code below demonstrates this.

But how does it work?

\A anchors the start of the string

\s* greedily grabs all whitespace. But without the (?s) modifier, it should stop at the end of the first line, should it not? See https://perldoc.perl.org/perlre.

Suppose that without the (?s) modifier it nevertheless "treats the string as a single line". Then I would expect the greedy \s* to grab every whitespace character it sees, including linefeeds. So it would pass the linefeed that precedes the "dogs" string, keep grabbing whitespace, run into the "d", and we would never get a match.

Nevertheless, the code does exactly what I want. Since I can't explain it, it's like a kludge, something that happens to work, discovered through trial and error. What is the reason it works?

#!/usr/bin/env perl 
use strict; use warnings;
print $^V; print "\n";

my @strs=(
    join('',"\n", "\t", ' ', "\n", "\t", ' dogs',),
    join('',
              "\n",
              "\n\t\t\x20",
              "\n\t\t\x20",
    '......so what?',
              "\n\t\t\x20",
    ),
);

my $count=0;
for my $onestring(@strs)
{
    $count++;
    print "\n$count ------------------------------------------\n"; 
    print "|$onestring|\n";
    (my $try1=$onestring)=~s/\A\s*\n//;
    print "|$try1|\n";
}


Solution

  • But how does it work?
    ...
    I would expect the greedy \s* to grab every whitespace character it sees, including linefeeds. So it would pass the linefeed that precedes the "dogs" string, keep grabbing whitespace, run into the "d", and we would never get a match.

    Correct -- the \s* at first grabs everything up to the d (in dogs) and with that the match would fail ... so it backs up, a character at a time, shortening that greedy grab so to give a chance to the following pattern, here \n, to match.

    And that works! So \s* matches up to (the last!) \n, that one is matched by the following \n in the pattern, and all is well. That's removed and we stay with "\tdogs" which is printed.

    This is called backtracking. See about it also in perlretut. Backtracking can be suppressed, most notably by possesive forms (like \w++ etc), or rather by extended construct (?>...).


    But without the (?s) modifier, it should stop at the end of the first line, should it not?

    Here you may be confusing \s with ., which indeed does not match \n (without /s)