Search code examples
regexperlcharacter-class

How to redefine \s to match underscores?


Perl (< v5.18) regular expression character class \s for whitespace is the same as [\t\n\f\r ].

Now, since some filenames use underscore as spaces, I was wondering if it's possible to redefine \s (locally) to match underscores in addition to whitespaces.

This would be merely for the sake of readability of otherwise convoluted regular expressions having many [\s_]. Can I do this? If so, how?


Solution

  • Whenever I think that something is impossible in Perl, it usually turns out that I am wrong. And sometimes when I think that something is very difficult in Perl, I am wrong, too. @sln pointed me to the right track

    Let's not override \s just yet, although you could. For the sake of the heirs of your program who expect \s to mean something specific, instead let's define the sequence \_ to mean "any whitespace character or the _ character" inside a regular expression. The details are in the link above, but the implementation looks like:

    package myspace;  # redefine  \_  to mean  [\s_]
    use overload;
    my %rules = ('\\' => '\\\\', '_' => qr/[\t\n\x{0B}\f\r _]/ );
    sub import {
        die if @_ > 1;
        overload::constant 'qr' => sub {
            my $re = shift;
            $re =~ s{\\(\\|_)}{$rules{$1}}gse;
            return $re;
        };
    }
    1;
    

    Now in your script, say

    use myspace;
    

    and now \_ in a regular expression means [\s_].

    Demo:

    use myspace;
    while (<DATA>) {
        chomp;
        if ($_ =~ /aaa\s.*txt/) {      # match whitespace
            print "match[1]: $_\n";
        }
        if ($_ =~ /aaa\_.*txt/) {      # match [\s_]
            print "match[2]: $_\n";
        }
        if ($_ =~ /\\_/) {             # match literal  '\_'
            print "match[3]: $_\n";
        }
    }
    __DATA__
    aaabbb.txt
    aaa\_ccc.txt
    cccaaa bbb.txt
    aaa_bbb.txt
    

    Output:

    match[3]: aaa\_ccc.txt
    match[1]: cccaaa bbb.txt
    match[2]: cccaaa bbb.txt
    match[2]: aaa_bbb.txt
    

    The third case is to demonstrate that \\_ in a regular expression will match a literal \_, like \\s will match a literal \s.