I want to read through a text file and partition each line into the following three variables. Each variable must be defined, although it might be equal to the empty string.
$a1code
: all characters up to and not including the first non-escaped percent sign. If there is no non-escaped percent sign, this is the entire line. As we see in the example below, this also could be the empty string in a line where the following two variables are non-empty.$a2boundary
: the first non-escaped percent sign, if there is one.$a3cmnt
: any characters after the first non-escaped percent sign, if there is one.The script below accomplishes this but requires several lines of code, two hashes, and a composite regex, that is, 2 regex combined by |
.
The composite seems necessary because the first clause,
(?<a1code>.*?)(?<a2boundary>(?<!\\)%)(?<a3cmnt>.*)
does not match a line that is pure code, no comment.
Is there a more elegant way, using a single regex and fewer steps?
In particular, is there a way to dispense with the %match
hash and somehow
fill the %+
hash with all three three variables in a single step?
#!/usr/bin/env perl
use strict; use warnings;
print join('', 'perl ', $^V, "\n",);
use Data::Dumper qw(Dumper); $Data::Dumper::Sortkeys = 1;
my $count=0;
while(<DATA>)
{
$count++;
print "$count\t";
chomp;
my %match=(
a2boundary=>'',
a3cmnt=>'',
);
print "|$_|\n";
if($_=~/^(?<a1code>.*?)(?<a2boundary>(?<!\\)%)(?<a3cmnt>.*)|(?<a1code>.*)/)
{
print "from regex:\n";
print Dumper \%+;
%match=(%match,%+,);
}
else
{
die "no match? coding error, should never get here";
}
if(scalar keys %+ != scalar keys %match)
{
print "from multiple lines of code:\n";
print Dumper \%match;
}
print "------------------------------------------\n";
}
__DATA__
This is 100\% text and below you find an empty line.
abba 5\% %comment 9\% %Borgia
%all comment
%
Result:
perl v5.34.0
1 |This is 100\% text and below you find an empty line. |
from regex:
$VAR1 = {
'a1code' => 'This is 100\\% text and below you find an empty line. '
};
from multiple lines of code:
$VAR1 = {
'a1code' => 'This is 100\\% text and below you find an empty line. ',
'a2boundary' => '',
'a3cmnt' => ''
};
------------------------------------------
2 ||
from regex:
$VAR1 = {
'a1code' => ''
};
from multiple lines of code:
$VAR1 = {
'a1code' => '',
'a2boundary' => '',
'a3cmnt' => ''
};
------------------------------------------
3 |abba 5\% %comment 9\% %Borgia|
from regex:
$VAR1 = {
'a1code' => 'abba 5\\% ',
'a2boundary' => '%',
'a3cmnt' => 'comment 9\\% %Borgia'
};
------------------------------------------
4 |%all comment|
from regex:
$VAR1 = {
'a1code' => '',
'a2boundary' => '%',
'a3cmnt' => 'all comment'
};
------------------------------------------
5 |%|
from regex:
$VAR1 = {
'a1code' => '',
'a2boundary' => '%',
'a3cmnt' => ''
};
------------------------------------------
What about cases like this\\%string
where the backslash before the percent sign is itself escaped?
Consider something like this, which instead of trying to use a regular expression to split the string into three groups, uses one to look where for it should be split, and substr
to do the actual splitting:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
sub splitter {
my $line = shift;
if ($line =~ /
# Match either
(?<!\\)% # A % not preceded by a backslash
| # or
(?<=[^\\])(?:\\\\)+\K% # Any even number of backslashes followed by a %
/x) {
return (substr($line, 0, $-[0]), '%', substr($line, $+[0]));
} else {
return ($line, '', '');
}
}
while (<DATA>) {
chomp;
# Assign to an array instead of individual scalars for demonstration purposes
my @vals = splitter $_;
print Dumper(\@vals);
}
__DATA__
This is 100\% text and below you find an empty line.
abba 5\% %comment 9\% %Borgia
%all comment
%
a tricky\\%test % case
another \\\%one % to mess with you