Search code examples
regexperl

match all commas that are outside parentheses and square brackets in perl regex


I'm trying to match, using regex, all commas(followed by a space): , that are outside any parentheses or square brackets, i.e. the comma should not be contained in the parentheses or square brackets.

The target string is A, An(hi, world[hello, (hi , world) world]); This, These. In this case, it should match the first comma and the last comma (the ones between A and An, this and these).

So I could split A, An(hi, world[hello, (hi , world) world]); This, These into A, An(hi, world[hello, (hi , world) world]); This and These, not leaving parens/brackets unbalanced as a result.

To that end, it seems hard to use regex alone. Is there any other approach to this problem?

The regex expression I'm using: , (?![^()\[\]]*[\)\]])

But this expression will match other extra two commas , (the second and the third) which shouldn't have been matched.

Though if it is matching against the following strings, it'll match the right comma (the first one respectively): A, An(hi, world) and A, An[hi, world]

But if the parenthesis and brackets contain each other, it'll be problems.

More details in this link: https://regex101.com/r/g8DOh6/1


Solution

  • The problem here is in identifying "balanced" pairs, of parenthesis/brackets in this case. This is a well recognized problem, for which there are libraries. They can find the top-level matching pairs, (...)/[...] with all that's inside, and all else outside parens -- then process the "else."

    One way, using Regexp::Common

    use warnings;
    use strict;
    use feature 'say';
    
    use Regexp::Common;
    
    my $str = shift // q{A, t(a,b(c,))u B, C, p(d,)q D,}; 
    
    my @all_parts = split /$RE{balanced}{-parens=>'()[]'}/, $str;
    
    my @no_paren_parts = grep { not /\(.*\) | \[.*\]/x } @all_parts;
    
    say for @no_paren_parts;
    

    This uses split's property to return the list with separators included when the regex in the separator pattern captures. The library regex captures so we get it all back -- the parts obtained by splitting the string by what regex matched but also the parts matched by the regex. The separators contain the paired delimiters while other terms cannot, by construction, so I filter them out by that. Prints

    A, t
    u B, C, p
    q D,
    

    The paren/bracket terms are gone, but how the string is split is otherwise a bit arbitrary.

    The above is somewhat "generic," using the library merely to extract the balanced pairs ()/[], along with all other parts of the string. Or, we can remove those patterns from the string

    $str =~ s/$RE{balanced}{-parens=>'()[]'}//g;
    

    to stay with

    A, tu B, C, pq D,
    

    Now one can simply split by commas

    my @terms = split /\s*,\s*/, $str;
    say for @terms;
    

    for

    A
    tu B
    C
    pq D
    

    This is the desired result in this case, as clarified in comments.

    Another most notable library, in many ways more fundamental, is the core Text::Balance. See Shawn's answer here, and for example this post and this one and this one for examples.


    An example. With

    my $str = q(it, is; surely);
    
    my @terms = split /[,;]/, $str;
    

    one gets it is surely in the array @terms, while with

    my @terms = split /([,;])/, $str;
    

    we get in @terms all of: it , is ; surely


    Also by construction, it contains what the regex matched at even indices. So for all other parts we can fetch elements at odd indices

    my @other_than_matched_parts = @all_parts[ grep { not $_ & 1 } 0..$#all_parts ];