I have an array of strings of the form:
@source = (
"something,something2,third"
,"something,something3 ,third"
,"something,something4"
,"something,something 5" # Note the space in the middle of the word
);
I need a regex which will extract the second of the comma separated words, BUT without the trailing spaces, putting those second words in an array.
@expected_result = ("something2","something3","something4","something 5");
What is the most readable way of achieving this?
I have 3 possibilities, neither of which seems optimal readability wise:
Pure regex and then capture $1
@result = map { (/[^,]+,([^,]*[^, ]) *(,|$)/ )[0] } @source;
Split on commas (this is NOT a CSV so no parsing needed), then trim:
@result = map { my @s = split(","), $s[1] =~ s/ *$//; $s[1] } @source;
Put split and trim into nested map
s
@result = map { s/ *$//; $_ } map { (split(","))[1] } @source;
Which one of these is better? Any other even more readable alternative I'm not thinking of?
Use named capture groups and give names to subpatterns with (DEFINE)
to greatly improve readability.
#! /usr/bin/env perl
use strict;
use warnings;
use 5.10.0; # for named capture buffer and (?&...)
my $second_trimmed_field_pattern = qr/
(?&FIRST_FIELD) (?&SEP) (?<f2> (?&SECOND_FIELD))
(?(DEFINE)
# The separator is a comma preceded by optional whitespace.
# NOTE: the format simple comma separators, NOT full CSV, so
# we don't have to worry about processing escapes or quoted
# fields.
(?<SEP> \s* ,)
# A field stops matching as soon as it sees a separator
# or end-of-string, so it matches in similar fashion to
# a pattern with a non-greedy quantifier.
(?<FIELD> (?: (?! (?&SEP) | $) .)+ )
# The first field is anchored at start-of-string.
(?<FIRST_FIELD> ^ (?&FIELD))
# The second field looks like any other field. The name
# captures our intent for its use in the main pattern.
(?<SECOND_FIELD> (?&FIELD))
)
/x;
In action:
my @source = (
"something,something2,third"
,"something,something3 ,third"
,"something,something4"
,"something,something 5" # Note the space in the middle of the word
);
for (@source) {
if (/$second_trimmed_field_pattern/) {
print "[$+{f2}]\n";
#print "[$1]\n"; # or do it the old-fashioned way
}
else {
chomp;
print "no match for [$_]\n";
}
}
Output:
[something2] [something3] [something4] [something 5]
You can express it similarly to older perls. Below, I confine the pieces to the lexical scope of a sub to show that they all work together as a unit.
sub make_second_trimmed_field_pattern {
my $sep = qr/
# The separator is a comma preceded by optional whitespace.
# NOTE: the format simple comma separators, NOT full CSV, so
# we don't have to worry about processing escapes or quoted
# fields.
\s* ,
/x;
my $field = qr/
# A field stops matching as soon as it sees a separator
# or end-of-string, so it matches in similar fashion to
# a pattern with a non-greedy quantifier.
(?:
# the next character to be matched is not the
# beginning of a separator sequence or
# end-of-string
(?! $sep | $ )
# ... so consume it
.
)+ # ... as many times as possible
/x;
qr/ ^ $field $sep ($field) /x;
}
Use it as in
my @source = ...; # same as above
my $second_trimmed_field_pattern = make_second_trimmed_field_pattern;
for (@source) {
if (/$second_trimmed_field_pattern/) {
print "[$1]\n";
}
else {
chomp;
print "no match for [$_]\n";
}
}
Output:
$ perl5.8.8 prog [something2] [something3] [something4] [something 5]