Search code examples
regexperlposix

Differences between empty and blank lines in regexps


There are already several good discussions of regular expressions and empty lines on SO. I'll remove this question if it is a duplicate.

Can anyone explain why this script outputs 5 3 4 5 4 3 instead of 4 3 4 4 4 3? When I run it in the debugger $blank and $classyblank stay at "4" (which I assume is the correct value) until the just before the print statement.

my ( $blank, $nonblank, $non_nonblank, 
     $classyblank,  $classyspace, $blanketyblank ) = 0 ;

while (<DATA>) {

  $blank++ if /\p{IsBlank}/         ; # POSIXly blank - 4?
  $nonblank++ if /^\P{IsBlank}$/    ; # POSIXly non-blank - 3
  $non_nonblank++ if not /\S/       ; # perlishly not non-blank - 4
  $classyblank++ if /[[:blank:]]/   ; # older(?) charclass blankness - 4?
  $classyspace++ if /^[[:space:]]$/ ; # older(?) charclass whitespace - 4
  $blanketyblank++ if /^$/          ; # perlishly *really empty*  - 3

}

print join " ", $blank, $nonblank, $non_nonblank,
            $classyblank, $classyspace, $blanketyblank , "\n" ;

__DATA__

line above only has a linefeed this one is not blank because: words

this line is followed by a line with white space (you may need to add it)

then another blank line following this one

THE END :-\

Is it something to do with the __DATA__ section or am I misunderstanding POSIX regular expressions?


ps:

As noted in comment on a timely post elsewhere, "really empty" (/^$/) can miss non-emptiness:

perl -E 'my $string = "\n" . "foo\n\n" ; say "empty" if $string =~ /^$/ ;'
perl -E 'my $string = "\n" . "bar\n\n" ; say "empty" if $string =~ /\A\z/ ;'
perl -E 'my $string = "\n" . "baz\n\n" ; say "empty" if $string =~ /\S/ ;' 

Solution

  • /\p{IsBlank}/ doesn't check for a empty string. \p matches a character that has the specified Unicode property.

    $ unichars '\p{IsBlank}' | cat
     ---- U+0009 CHARACTER TABULATION
     ---- U+0020 SPACE
     ---- U+00A0 NO-BREAK SPACE
     ---- U+1680 OGHAM SPACE MARK
     ---- U+2000 EN QUAD
     ---- U+2001 EM QUAD
     ---- U+2002 EN SPACE
     ---- U+2003 EM SPACE
     ---- U+2004 THREE-PER-EM SPACE
     ---- U+2005 FOUR-PER-EM SPACE
     ---- U+2006 SIX-PER-EM SPACE
     ---- U+2007 FIGURE SPACE
     ---- U+2008 PUNCTUATION SPACE
     ---- U+2009 THIN SPACE
     ---- U+200A HAIR SPACE
     ---- U+202F NARROW NO-BREAK SPACE
     ---- U+205F MEDIUM MATHEMATICAL SPACE
     ---- U+3000 IDEOGRAPHIC SPACE
    

    It matches " \n" since SPACE has the IsBlank property.


    /[[:blank:]]/ doesn't check for a empty string. [...] matches a character that is a member of the specified class.

    $ unichars '[[:blank:]]' | cat
     ---- U+0009 CHARACTER TABULATION
     ---- U+0020 SPACE
     ---- U+00A0 NO-BREAK SPACE
     ---- U+1680 OGHAM SPACE MARK
     ---- U+2000 EN QUAD
     ---- U+2001 EM QUAD
     ---- U+2002 EN SPACE
     ---- U+2003 EM SPACE
     ---- U+2004 THREE-PER-EM SPACE
     ---- U+2005 FOUR-PER-EM SPACE
     ---- U+2006 SIX-PER-EM SPACE
     ---- U+2007 FIGURE SPACE
     ---- U+2008 PUNCTUATION SPACE
     ---- U+2009 THIN SPACE
     ---- U+200A HAIR SPACE
     ---- U+202F NARROW NO-BREAK SPACE
     ---- U+205F MEDIUM MATHEMATICAL SPACE
     ---- U+3000 IDEOGRAPHIC SPACE
    

    It matches " \n" since SPACE is a member of the [:blank:] POSIX character class and thus a member of the [[:blank:]] character class.