I'm trying to match, using regex, all commas(followed by a space): ,
that are outside any parentheses or square brackets, i.e. the comma should not be contained in the parentheses or square brackets.
The target string is A, An(hi, world[hello, (hi , world) world]); This, These
. In this case, it should match the first comma and the last comma (the ones between A
and An
, this
and these
).
So I could split A, An(hi, world[hello, (hi , world) world]); This, These
into A
, An(hi, world[hello, (hi , world) world]); This
and These
, not leaving parens/brackets unbalanced as a result.
To that end, it seems hard to use regex alone. Is there any other approach to this problem?
The regex expression I'm using:
, (?![^()\[\]]*[\)\]])
But this expression will match other extra two commas ,
(the second and the third) which shouldn't have been matched.
Though if it is matching against the following strings, it'll match the right comma (the first one respectively): A, An(hi, world)
and A, An[hi, world]
But if the parenthesis and brackets contain each other, it'll be problems.
More details in this link: https://regex101.com/r/g8DOh6/1
The problem here is in identifying "balanced" pairs, of parenthesis/brackets in this case. This is a well recognized problem, for which there are libraries. They can find the top-level matching pairs, (...)
/[...]
with all that's inside, and all else outside parens -- then process the "else."
One way, using Regexp::Common
use warnings;
use strict;
use feature 'say';
use Regexp::Common;
my $str = shift // q{A, t(a,b(c,))u B, C, p(d,)q D,};
my @all_parts = split /$RE{balanced}{-parens=>'()[]'}/, $str;
my @no_paren_parts = grep { not /\(.*\) | \[.*\]/x } @all_parts;
say for @no_paren_parts;
This uses split's property to return the list with separators included when the regex in the separator pattern captures.† The library regex captures so we get it all back -- the parts obtained by splitting the string by what regex matched but also the parts matched by the regex. The separators contain the paired delimiters while other terms cannot, by construction, so I filter them out by that.‡ Prints
A, t u B, C, p q D,
The paren/bracket terms are gone, but how the string is split is otherwise a bit arbitrary.
The above is somewhat "generic," using the library merely to extract the balanced pairs ()
/[]
, along with all other parts of the string. Or, we can remove those patterns from the string
$str =~ s/$RE{balanced}{-parens=>'()[]'}//g;
to stay with
A, tu B, C, pq D,
Now one can simply split by commas
my @terms = split /\s*,\s*/, $str;
say for @terms;
for
A tu B C pq D
This is the desired result in this case, as clarified in comments.
Another most notable library, in many ways more fundamental, is the core Text::Balance
. See Shawn's answer here, and for example this post and this one and this one for examples.
† An example. With
my $str = q(it, is; surely);
my @terms = split /[,;]/, $str;
one gets it
is
surely
in the array @terms
, while with
my @terms = split /([,;])/, $str;
we get in @terms
all of: it
,
is
;
surely
‡ Also by construction, it contains what the regex matched at even indices. So for all other parts we can fetch elements at odd indices
my @other_than_matched_parts = @all_parts[ grep { not $_ & 1 } 0..$#all_parts ];