I'm trying to pull out all comma separated entries from strings but I'm having issues with some values that have comma's in quoted data, or within parentheses. I've tried to use Parse::CSV and Text::ParseWords but it doesn't catch the comma separated ones that are in between parentheses. I also have a piece of code which does the opposite, it finds the comma between the parentheses but misses the one between the quotes. I took over the code for this so I'm not exactly sure how to add to the regex block. I'm attaching the code, plus the response. There are 2 examples, one with the Text::ParseWords example and the other with the regex example. Any help is appreciated.
my $string1 = "10507, STAGE, 'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',EXPORT_LOG_MSG,EXPORT_LOG_MSG_COUNT";
my $string2 = "10507, STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG, NAME(FIRST,LAST)";
for (my $i=0; $i<2; $i++) {
my $string;
$string = $string1 if ($i == 0);
$string = $string2 if ($i == 1);
my @fields1 = ();
my @fields2 = ();
@fields1 = parse_line(q{,}, 0, $string);
push @fields2, $1 while $string =~ /
((?:
[^(),]+ |
( \(
(?: [^()]+ | (?2) )*
\) )
)*)
(?: ,\s* | $)
/xg;
print "\n\nProcessing String - $string\n";
my $cnt1 = 0;
print "\n";
foreach my $field (@fields1) {
$field =~ s/^\s+|\s+$//g;
print "\nExample 1: field $cnt1 is $field\n";
$cnt1++;
}
my $cnt2 = 0;
print "\n";
foreach my $field (@fields2) {
$field =~ s/^\s+|\s+$//g;
print "\nExample 2: field $cnt2 is $field\n";
$cnt2++;
}
}
and here is the response
Processing String - 10507, STAGE, 'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',EXPORT_LOG_MSG,EXPORT_LOG_MSG_COUNT
Example 1: field 0 is 10507
Example 1: field 1 is STAGE
Example 1: field 2 is SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM
Example 1: field 3 is EXPORT_LOG_MSG
Example 1: field 4 is EXPORT_LOG_MSG_COUNT
Example 2: field 0 is 10507
Example 2: field 1 is STAGE
Example 2: field 2 is 'SELECT CNTC_KEY
Example 2: field 3 is CUST_FST_NME
Example 2: field 4 is CUST_LST_NM'
Example 2: field 5 is EXPORT_LOG_MSG
Example 2: field 6 is EXPORT_LOG_MSG_COUNT
Example 2: field 7 is
Processing String - 10507, STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG, NAME(FIRST,LAST)
Example 1: field 0 is 10507
Example 1: field 1 is STAGE
Example 1: field 2 is SELECT,PRTNR_CNTC_KEY
Example 1: field 3 is SQL_EXPORT_COUNT
Example 1: field 4 is EXPORT_LOG_MSG
Example 1: field 5 is NAME(FIRST
Example 1: field 6 is LAST)
Example 2: field 0 is 10507
Example 2: field 1 is STAGE
Example 2: field 2 is 'SELECT
Example 2: field 3 is PRTNR_CNTC_KEY'
Example 2: field 4 is SQL_EXPORT_COUNT
Example 2: field 5 is EXPORT_LOG_MSG
Example 2: field 6 is NAME(FIRST,LAST)
Example 2: field 7 is
I'm editing this now as to some confusion about what I'm looking to accomplish. I'll take the 2nd string as an example.
10507, STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG, NAME(FIRST,LAST)
field1 should be 10507
field2 should be STAGE
field3 should be 'SELECT,PRTNR_CNTC_KEY'
field4 should be SQL_EXPORT_COUNT
field5 should be EXPORT_LOG_MSG
field6 should be NAME(FIRST,LAST)
so any comma that is within () or within '' should NOT be separated. I might even run into other character conditions such as [] or double quotes as well.
You have too strings, which I've simplified to fit nicely on the screen since it's the last field that is giving you a problem:
10507, STAGE, 'SELECT CNTC_KEY,CUST_LST_NM', EXPORT_LOG_MSG_COUNT
10507, STAGE, 'SELECT,PRTNR_CNTC_KEY', NAME(FIRST,LAST)
It's easy to deal with the first example because the single quotes delimit the extent of that third field. The second one is tougher because that field has nothing to delimit it. There's nothing to tell any general parser that NAME(FIRST,LAST)
should be one thing.
I don't think there's a way to make that regex work with warping people's minds. It looks like a field like NAME(FIRST,LAST)
can show up in any position. Maybe there are lots of other special cases.
For these things, I hope the constraints are simple enough that I can write a parser (and if you look inside Text::CSV_XS, you'll see the same thing going on). Basically, you walk along the string and remember what you are doing. Depending on what you see next, you do something and maybe affect some state. You keep doing that until you are done.
This subroutine uses Perl's global matching in scalar context with the /gc/
flags, which is a powerful feature once you get comfortable with it (maybe I cover this in Mastering Perl. The /g
is the global bit, and being in the conditional provides the scalar context. The /c
is the trick: if the match fails, it tells Perl not to reset the match position, but to start the next match on the same string at the current position. The \G
anchor to the starting position of the match. See the pos docs.
sub parse_line {
local $_ = shift;
my @parts = ();
my $found_field;
while( 1 ) {
if( /\G $/gcx ) { # at the end
push @parts, '' unless $found_field;
last;
}
elsif( /\G \s* , \s* /gcxp ) { # field separator, with padding
push @parts, '' unless $found_field;
$found_field = 0;
}
elsif( /\G (?<quote>['"]) (?<field> .*? ) (?<!\\) \g{quote} /gcx ) { # quoted field
push @parts, $+{field};
$found_field = 1;
}
elsif( /\G \( (?<field> .*? ) (?<!\\) \) /gcxp ) { # partial field with parens
if( $found_field ) { $parts[-1] .= ${^MATCH} }
else { push @parts, ${^MATCH} }
$found_field = 1;
}
elsif( /\G [^,(]* /gcxp ) { # partial field, maybe more coming
push @parts, ${^MATCH};
$found_field = 1;
}
else {
Carp::carp "Parsing ended abruptly at pos " . pos;
last;
}
}
return \@parts
}
In that code, you insert other branches if there are other cases that you want to handle. For example, if you're playing with Excel and it's crazy way of escaping double quotes ("field "" <--- embedded"
), you have to handle that in some other way. That's what makes this problem so hard in general. If you can constrain it locally, it gets easier.
And, I think I've probably done a few things that hard way in that, but I just work up so you'll have to live with that :)
Here's a test program that covers the corner cases I could think of off the top of my head.
use v5.26;
use Carp;
use Data::Dumper qw(Dumper);
use Test::More;
my @table = (
[
'embedded commas in single quotes',
q(10507, STAGE, 'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',EXPORT_LOG_MSG,EXPORT_LOG_MSG_COUNT),
[
'10507',
'STAGE',
'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',
'EXPORT_LOG_MSG',
'EXPORT_LOG_MSG_COUNT'
],
],
[
'embedded commas in double quotes',
q(10507, STAGE, "SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM",EXPORT_LOG_MSG,EXPORT_LOG_MSG_COUNT),
[
'10507',
'STAGE',
'SELECT CNTC_KEY,CUST_FST_NME,CUST_LST_NM',
'EXPORT_LOG_MSG',
'EXPORT_LOG_MSG_COUNT'
],
],
[
'embedded commas in parens, with prefix, last field',
q(10507, STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG, NAME(FIRST,LAST)),
[
'10507',
'STAGE',
'SELECT,PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG',
'NAME(FIRST,LAST)'
],
],
[
'embedded commas in parens, with prefix, second field',
q(10507, NAME(FIRST,LAST), STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG),
[
'10507',
'NAME(FIRST,LAST)',
'STAGE',
'SELECT,PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'embedded opening paren in quoted field',
q(10507, 'NAME(FIRST', LAST), STAGE, 'SELECT,PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG),
[
'10507',
'NAME(FIRST',
'LAST)',
'STAGE',
'SELECT,PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'escaped quote in quoted field',
q|10507, 'NAME(FIRST', LAST), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
[
'10507',
'NAME(FIRST',
'LAST)',
'STAGE',
'SELECT,\\\'PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'escaped closing paren in paren field',
q|10507, NAME(FIRST, \\) LAST), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
[
'10507',
'NAME(FIRST, \\) LAST)',
'STAGE',
'SELECT,\\\'PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'empty parens',
q|10507, NAME(), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
[
'10507',
'NAME()',
'STAGE',
'SELECT,\\\'PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'empty parens only',
q|10507, (), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
[
'10507',
'()',
'STAGE',
'SELECT,\\\'PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'field starts with paren',
q|10507, (FIRST, LAST), STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG|,
[
'10507',
'(FIRST, LAST)',
'STAGE',
'SELECT,\\\'PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'field is empty',
q(10507, , STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG),
[
'10507',
'',
'STAGE',
'SELECT,\\\'PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'first field is empty',
q(, STAGE, 'SELECT,\'PRTNR_CNTC_KEY',SQL_EXPORT_COUNT, EXPORT_LOG_MSG),
[
'',
'STAGE',
'SELECT,\\\'PRTNR_CNTC_KEY',
'SQL_EXPORT_COUNT',
'EXPORT_LOG_MSG'
],
],
[
'all empty',
q(, , ,, ,),
[
'',
'',
'',
'',
'',
'',
],
],
);
foreach my $row ( @table ) {
my( $label, $expected ) = $row->@[0,2];
my $expected = $row->[2];
my $parts = parse_line($row->[1]);
is_deeply $parts, $expected, $label or do {
diag( $row->[1] );
diag( Dumper( $parts ) )
};
}
sub parse_line {
local $_ = shift;
my @parts = ();
my $found_field;
while( 1 ) {
if( /\G $/gcx ) { # at the end
push @parts, '' unless $found_field;
last;
}
elsif( /\G \s* , \s* /gcxp ) { # field separator, with padding
push @parts, '' unless $found_field;
$found_field = 0;
}
elsif( /\G (?<quote>['"]) (?<field> .*? ) (?<!\\) \g{quote} /gcx ) { # quoted field
push @parts, $+{field};
$found_field = 1;
}
elsif( /\G \( (?<field> .*? ) (?<!\\) \) /gcxp ) { # partial field with parens
if( $found_field ) { $parts[-1] .= ${^MATCH} }
else { push @parts, ${^MATCH} }
$found_field = 1;
}
elsif( /\G [^,(]* /gcxp ) { # partial field, maybe more coming
push @parts, ${^MATCH};
$found_field = 1;
}
else {
Carp::carp "Parsing ended abruptly at pos " . pos;
last;
}
}
return \@parts
}
done_testing;