Search code examples
perlhashtableregex-lookaroundsregex-group

how can I partition a line into code and comment using a single regex in perl?


I want to read through a text file and partition each line into the following three variables. Each variable must be defined, although it might be equal to the empty string.

  • $a1code: all characters up to and not including the first non-escaped percent sign. If there is no non-escaped percent sign, this is the entire line. As we see in the example below, this also could be the empty string in a line where the following two variables are non-empty.
  • $a2boundary: the first non-escaped percent sign, if there is one.
  • $a3cmnt: any characters after the first non-escaped percent sign, if there is one.

The script below accomplishes this but requires several lines of code, two hashes, and a composite regex, that is, 2 regex combined by |. The composite seems necessary because the first clause,

(?<a1code>.*?)(?<a2boundary>(?<!\\)%)(?<a3cmnt>.*)

does not match a line that is pure code, no comment. Is there a more elegant way, using a single regex and fewer steps? In particular, is there a way to dispense with the %match hash and somehow fill the %+ hash with all three three variables in a single step?

#!/usr/bin/env perl
use strict; use warnings;
print join('', 'perl ', $^V, "\n",);
use Data::Dumper qw(Dumper); $Data::Dumper::Sortkeys = 1;

my $count=0;
while(<DATA>)
{
    $count++;
    print "$count\t";
    chomp;
    my %match=(
        a2boundary=>'',
        a3cmnt=>'',
    );
    print "|$_|\n";
    if($_=~/^(?<a1code>.*?)(?<a2boundary>(?<!\\)%)(?<a3cmnt>.*)|(?<a1code>.*)/)
    {
        print "from regex:\n";
        print Dumper \%+;
        %match=(%match,%+,);
    }
    else
    {
        die "no match? coding error, should never get here";
    }
    if(scalar keys %+ != scalar keys %match)
    {
        print "from multiple lines of code:\n";
        print Dumper \%match;
    }
    print "------------------------------------------\n";
}

__DATA__
This is 100\% text and below you find an empty line.

abba 5\% %comment 9\% %Borgia
%all comment
%

Result:

perl v5.34.0
1   |This is 100\% text and below you find an empty line.   |
from regex:
$VAR1 = {
          'a1code' => 'This is 100\\% text and below you find an empty line.   '
        };
from multiple lines of code:
$VAR1 = {
          'a1code' => 'This is 100\\% text and below you find an empty line.   ',
          'a2boundary' => '',
          'a3cmnt' => ''
        };
------------------------------------------
2   ||
from regex:
$VAR1 = {
          'a1code' => ''
        };
from multiple lines of code:
$VAR1 = {
          'a1code' => '',
          'a2boundary' => '',
          'a3cmnt' => ''
        };
------------------------------------------
3   |abba 5\% %comment 9\% %Borgia|
from regex:
$VAR1 = {
          'a1code' => 'abba 5\\% ',
          'a2boundary' => '%',
          'a3cmnt' => 'comment 9\\% %Borgia'
        };
------------------------------------------
4   |%all comment|
from regex:
$VAR1 = {
          'a1code' => '',
          'a2boundary' => '%',
          'a3cmnt' => 'all comment'
        };
------------------------------------------
5   |%|
from regex:
$VAR1 = {
          'a1code' => '',
          'a2boundary' => '%',
          'a3cmnt' => ''
        };
------------------------------------------

Solution

  • What about cases like this\\%string where the backslash before the percent sign is itself escaped?

    Consider something like this, which instead of trying to use a regular expression to split the string into three groups, uses one to look where for it should be split, and substr to do the actual splitting:

    #!/usr/bin/env perl
    use strict;
    use warnings;
    use Data::Dumper;
    
    sub splitter {
        my $line = shift;
        if ($line =~ /
           # Match either
           (?<!\\)% # A % not preceded by a backslash    
           | # or                    
           (?<=[^\\])(?:\\\\)+\K% # Any even number of backslashes followed by a %
                     /x) {
            return (substr($line, 0, $-[0]), '%', substr($line, $+[0]));        
        } else {
            return ($line, '', '');
        }
    }
    
    while (<DATA>) {
        chomp;
        # Assign to an array instead of individual scalars for demonstration purposes
        my @vals = splitter $_;
        print Dumper(\@vals);
    }   
    
    __DATA__
    This is 100\% text and below you find an empty line.
    
    abba 5\% %comment 9\% %Borgia
    %all comment
    %
    a tricky\\%test % case
    another \\\%one % to mess with you