Search code examples
regexperl

parse string by comma, but ignore comma's within quotes and parentheses


I'm trying to pull out all comma separated entries from strings but I'm having issues with some values that have comma's in quoted data, or within parentheses. I've tried to use Parse::CSV and Text::ParseWords but it doesn't catch the comma separated ones that are in between parentheses. I also have a piece of code which does the opposite, it finds the comma between the parentheses but misses the one between the quotes. I took over the code for this so I'm not exactly sure how to add to the regex block. I'm attaching the code, plus the response. There are 2 examples, one with the Text::ParseWords example and the other with the regex example. Any help is appreciated.

my $string1 = "10507, STAGE, 'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',EXPORT_LOG_MSG,EXPORT_LOG_MSG_COUNT";

my $string2 = "10507, STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG, NAME(FIRST,LAST)";

for (my $i=0; $i<2; $i++) {
  my $string;
  $string = $string1 if ($i == 0);
  $string = $string2 if ($i == 1);
  my @fields1 = ();
  my @fields2 = ();

  @fields1 = parse_line(q{,}, 0, $string);
  push @fields2, $1 while $string =~ /
                      ((?:
                       [^(),]+ |
                       ( \(
                       (?: [^()]+ | (?2) )*
                       \) )
                       )*)
                       (?: ,\s* | $)
                       /xg;

  print "\n\nProcessing String - $string\n";

  my $cnt1 = 0;
  print "\n";
  foreach my $field (@fields1) {
    $field =~ s/^\s+|\s+$//g;
    print "\nExample 1:  field $cnt1 is $field\n";
    $cnt1++;
  }

  my $cnt2 = 0;
  print "\n";
  foreach my $field (@fields2) {
    $field =~ s/^\s+|\s+$//g;
    print "\nExample 2:  field $cnt2 is $field\n";
    $cnt2++;
  }
}

and here is the response

Processing String - 10507, STAGE, 'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',EXPORT_LOG_MSG,EXPORT_LOG_MSG_COUNT


Example 1:  field 0 is 10507

Example 1:  field 1 is STAGE

Example 1:  field 2 is SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM

Example 1:  field 3 is EXPORT_LOG_MSG

Example 1:  field 4 is EXPORT_LOG_MSG_COUNT


Example 2:  field 0 is 10507

Example 2:  field 1 is STAGE

Example 2:  field 2 is 'SELECT CNTC_KEY

Example 2:  field 3 is CUST_FST_NME

Example 2:  field 4 is CUST_LST_NM'

Example 2:  field 5 is EXPORT_LOG_MSG

Example 2:  field 6 is EXPORT_LOG_MSG_COUNT

Example 2:  field 7 is


Processing String - 10507, STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG, NAME(FIRST,LAST)


Example 1:  field 0 is 10507

Example 1:  field 1 is STAGE

Example 1:  field 2 is SELECT,PRTNR_CNTC_KEY

Example 1:  field 3 is SQL_EXPORT_COUNT

Example 1:  field 4 is EXPORT_LOG_MSG

Example 1:  field 5 is NAME(FIRST

Example 1:  field 6 is LAST)


Example 2:  field 0 is 10507

Example 2:  field 1 is STAGE

Example 2:  field 2 is 'SELECT

Example 2:  field 3 is PRTNR_CNTC_KEY'

Example 2:  field 4 is SQL_EXPORT_COUNT

Example 2:  field 5 is EXPORT_LOG_MSG

Example 2:  field 6 is NAME(FIRST,LAST)

Example 2:  field 7 is

I'm editing this now as to some confusion about what I'm looking to accomplish. I'll take the 2nd string as an example.

10507, STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG, NAME(FIRST,LAST)

field1 should be 10507
field2 should be STAGE
field3 should be 'SELECT,PRTNR_CNTC_KEY'
field4 should be SQL_EXPORT_COUNT
field5 should be EXPORT_LOG_MSG
field6 should be NAME(FIRST,LAST)

so any comma that is within () or within '' should NOT be separated. I might even run into other character conditions such as [] or double quotes as well.


Solution

  • You have too strings, which I've simplified to fit nicely on the screen since it's the last field that is giving you a problem:

    10507, STAGE, 'SELECT CNTC_KEY,CUST_LST_NM', EXPORT_LOG_MSG_COUNT
    
    10507, STAGE, 'SELECT,PRTNR_CNTC_KEY', NAME(FIRST,LAST)
    

    It's easy to deal with the first example because the single quotes delimit the extent of that third field. The second one is tougher because that field has nothing to delimit it. There's nothing to tell any general parser that NAME(FIRST,LAST) should be one thing.

    I don't think there's a way to make that regex work with warping people's minds. It looks like a field like NAME(FIRST,LAST) can show up in any position. Maybe there are lots of other special cases.

    For these things, I hope the constraints are simple enough that I can write a parser (and if you look inside Text::CSV_XS, you'll see the same thing going on). Basically, you walk along the string and remember what you are doing. Depending on what you see next, you do something and maybe affect some state. You keep doing that until you are done.

    This subroutine uses Perl's global matching in scalar context with the /gc/ flags, which is a powerful feature once you get comfortable with it (maybe I cover this in Mastering Perl. The /g is the global bit, and being in the conditional provides the scalar context. The /c is the trick: if the match fails, it tells Perl not to reset the match position, but to start the next match on the same string at the current position. The \G anchor to the starting position of the match. See the pos docs.

    sub parse_line {
        local $_ = shift;
        my @parts = ();
    
        my $found_field;
        while( 1 ) {
            if( /\G $/gcx ) {  # at the end
                push @parts, '' unless $found_field;
                last;
                }
            elsif( /\G \s* , \s* /gcxp ) { # field separator, with padding
                push @parts, '' unless $found_field;
                $found_field = 0;
                }
            elsif( /\G (?<quote>['"]) (?<field> .*? ) (?<!\\) \g{quote} /gcx ) { # quoted field
                push @parts, $+{field};
                $found_field = 1;
                }
            elsif( /\G \(  (?<field> .*? )  (?<!\\) \) /gcxp ) { # partial field with parens
                if( $found_field ) { $parts[-1] .= ${^MATCH} }
                else               { push @parts, ${^MATCH} }
                $found_field = 1;
                }
            elsif( /\G [^,(]* /gcxp ) { # partial field, maybe more coming
                push @parts, ${^MATCH};
                $found_field = 1;
                }
            else {
                Carp::carp "Parsing ended abruptly at pos " . pos;
                last;
                }
            }
    
        return \@parts
        }
    

    In that code, you insert other branches if there are other cases that you want to handle. For example, if you're playing with Excel and it's crazy way of escaping double quotes ("field "" <--- embedded"), you have to handle that in some other way. That's what makes this problem so hard in general. If you can constrain it locally, it gets easier.

    And, I think I've probably done a few things that hard way in that, but I just work up so you'll have to live with that :)

    Here's a test program that covers the corner cases I could think of off the top of my head.

    use v5.26;
    use Carp;
    use Data::Dumper qw(Dumper);
    use Test::More;
    
    my @table = (
        [
            'embedded commas in single quotes',
            q(10507, STAGE, 'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',EXPORT_LOG_MSG,EXPORT_LOG_MSG_COUNT),
            [
              '10507',
              'STAGE',
              'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',
              'EXPORT_LOG_MSG',
              'EXPORT_LOG_MSG_COUNT'
            ],
        ],
        [
            'embedded commas in double quotes',
            q(10507, STAGE, "SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM",EXPORT_LOG_MSG,EXPORT_LOG_MSG_COUNT),
            [
              '10507',
              'STAGE',
              'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',
              'EXPORT_LOG_MSG',
              'EXPORT_LOG_MSG_COUNT'
            ],
        ],
        [
            'embedded commas in parens, with prefix, last field',
            q(10507, STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG, NAME(FIRST,LAST)),
            [
              '10507',
              'STAGE',
              'SELECT,PRTNR_CNTC_KEY',
              'SQL_EXPORT_COUNT',
              'EXPORT_LOG_MSG',
              'NAME(FIRST,LAST)'
            ],
        ],
        [
            'embedded commas in parens, with prefix, second field',
            q(10507, NAME(FIRST,LAST), STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG),
            [
              '10507',
              'NAME(FIRST,LAST)',
              'STAGE',
              'SELECT,PRTNR_CNTC_KEY',
              'SQL_EXPORT_COUNT',
              'EXPORT_LOG_MSG'
            ],
        ],
        [
            'embedded opening paren in quoted field',
            q(10507, 'NAME(FIRST', LAST), STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG),
            [
              '10507',
              'NAME(FIRST',
              'LAST)',
              'STAGE',
              'SELECT,PRTNR_CNTC_KEY',
              'SQL_EXPORT_COUNT',
              'EXPORT_LOG_MSG'
            ],
        ],
        [
            'escaped quote in quoted field',
            q|10507, 'NAME(FIRST', LAST), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
            [
              '10507',
              'NAME(FIRST',
              'LAST)',
              'STAGE',
              'SELECT,\\\'PRTNR_CNTC_KEY',
              'SQL_EXPORT_COUNT',
              'EXPORT_LOG_MSG'
            ],
        ],
        [
            'escaped closing paren in paren field',
            q|10507, NAME(FIRST, \\) LAST), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
            [
              '10507',
              'NAME(FIRST, \\) LAST)',
              'STAGE',
              'SELECT,\\\'PRTNR_CNTC_KEY',
              'SQL_EXPORT_COUNT',
              'EXPORT_LOG_MSG'
            ],
        ],
        [
            'empty parens',
            q|10507, NAME(), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
            [
              '10507',
              'NAME()',
              'STAGE',
              'SELECT,\\\'PRTNR_CNTC_KEY',
              'SQL_EXPORT_COUNT',
              'EXPORT_LOG_MSG'
            ],
        ],
        [
            'empty parens only',
            q|10507, (), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
            [
              '10507',
              '()',
              'STAGE',
              'SELECT,\\\'PRTNR_CNTC_KEY',
              'SQL_EXPORT_COUNT',
              'EXPORT_LOG_MSG'
            ],
        ],
        [
            'field starts with paren',
            q|10507, (FIRST, LAST), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
            [
              '10507',
              '(FIRST, LAST)',
              'STAGE',
              'SELECT,\\\'PRTNR_CNTC_KEY',
              'SQL_EXPORT_COUNT',
              'EXPORT_LOG_MSG'
            ],
        ],
        [
            'field is empty',
            q(10507, , STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG),
            [
              '10507',
              '',
              'STAGE',
              'SELECT,\\\'PRTNR_CNTC_KEY',
              'SQL_EXPORT_COUNT',
              'EXPORT_LOG_MSG'
            ],
        ],
        [
            'first field is empty',
            q(, STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG),
            [
              '',
              'STAGE',
              'SELECT,\\\'PRTNR_CNTC_KEY',
              'SQL_EXPORT_COUNT',
              'EXPORT_LOG_MSG'
            ],
        ],
        [
            'all empty',
            q(, , ,, ,),
            [
              '',
              '',
              '',
              '',
              '',
              '',
            ],
        ],
    
        );
    
    foreach my $row ( @table ) {
        my( $label, $expected ) = $row->@[0,2];
        my $expected = $row->[2];
    
        my $parts = parse_line($row->[1]);
    
        is_deeply $parts, $expected, $label or do {
            diag( $row->[1] );
            diag( Dumper( $parts ) )
            };
        }
    
    sub parse_line {
        local $_ = shift;
        my @parts = ();
    
        my $found_field;
        while( 1 ) {
            if( /\G $/gcx ) {  # at the end
                push @parts, '' unless $found_field;
                last;
                }
            elsif( /\G \s* , \s* /gcxp ) { # field separator, with padding
                push @parts, '' unless $found_field;
                $found_field = 0;
                }
            elsif( /\G (?<quote>['"]) (?<field> .*? ) (?<!\\) \g{quote} /gcx ) { # quoted field
                push @parts, $+{field};
                $found_field = 1;
                }
            elsif( /\G \(  (?<field> .*? )  (?<!\\) \) /gcxp ) { # partial field with parens
                if( $found_field ) { $parts[-1] .= ${^MATCH} }
                else               { push @parts, ${^MATCH} }
                $found_field = 1;
                }
            elsif( /\G [^,(]* /gcxp ) { # partial field, maybe more coming
                push @parts, ${^MATCH};
                $found_field = 1;
                }
            else {
                Carp::carp "Parsing ended abruptly at pos " . pos;
                last;
                }
            }
    
        return \@parts
        }
    
    done_testing;