Search code examples
regexperlreadability

What is the most readable regex to extract a second word with no trailing spaces from comma-separated string?


I have an array of strings of the form:

@source = (
     "something,something2,third"
    ,"something,something3   ,third"
    ,"something,something4"
    ,"something,something 5" # Note the space in the middle of the word
);

I need a regex which will extract the second of the comma separated words, BUT without the trailing spaces, putting those second words in an array.

@expected_result = ("something2","something3","something4","something 5");

What is the most readable way of achieving this?

I have 3 possibilities, neither of which seems optimal readability wise:

  1. Pure regex and then capture $1

    @result = map { (/[^,]+,([^,]*[^, ]) *(,|$)/ )[0] } @source;
    
  2. Split on commas (this is NOT a CSV so no parsing needed), then trim:

    @result = map { my @s = split(","), $s[1] =~ s/ *$//; $s[1] } @source;
    
  3. Put split and trim into nested maps

    @result = map { s/ *$//; $_ } map { (split(","))[1] } @source;
    

Which one of these is better? Any other even more readable alternative I'm not thinking of?


Solution

  • Use named capture groups and give names to subpatterns with (DEFINE) to greatly improve readability.

    #! /usr/bin/env perl
    
    use strict;
    use warnings;
    
    use 5.10.0;  # for named capture buffer and (?&...)
    
    my $second_trimmed_field_pattern = qr/
      (?&FIRST_FIELD) (?&SEP) (?<f2> (?&SECOND_FIELD))
    
      (?(DEFINE)
        # The separator is a comma preceded by optional whitespace.
        # NOTE: the format simple comma separators, NOT full CSV, so
        # we don't have to worry about processing escapes or quoted
        # fields.
        (?<SEP>  \s* ,)
    
        # A field stops matching as soon as it sees a separator
        # or end-of-string, so it matches in similar fashion to
        # a pattern with a non-greedy quantifier.
        (?<FIELD> (?: (?! (?&SEP) | $) .)+ )
    
        # The first field is anchored at start-of-string.
        (?<FIRST_FIELD>  ^  (?&FIELD))
    
        # The second field looks like any other field. The name
        # captures our intent for its use in the main pattern.
        (?<SECOND_FIELD> (?&FIELD))
      )
    /x;
    

    In action:

    my @source = (
         "something,something2,third"
        ,"something,something3   ,third"
        ,"something,something4"
        ,"something,something 5" # Note the space in the middle of the word
    );
    
    for (@source) {
      if (/$second_trimmed_field_pattern/) {
        print "[$+{f2}]\n";
    
        #print "[$1]\n";  # or do it the old-fashioned way
      }
      else {
        chomp;
        print "no match for [$_]\n";
      }
    }
    

    Output:

    [something2]
    [something3]
    [something4]
    [something 5]

    You can express it similarly to older perls. Below, I confine the pieces to the lexical scope of a sub to show that they all work together as a unit.

    sub make_second_trimmed_field_pattern {
      my $sep = qr/
        # The separator is a comma preceded by optional whitespace.
        # NOTE: the format simple comma separators, NOT full CSV, so
        # we don't have to worry about processing escapes or quoted
        # fields.
    
        \s* ,
      /x;
    
      my $field = qr/
        # A field stops matching as soon as it sees a separator
        # or end-of-string, so it matches in similar fashion to
        # a pattern with a non-greedy quantifier.
        (?:
            # the next character to be matched is not the
            # beginning of a separator sequence or
            # end-of-string
            (?! $sep | $ )
    
            # ... so consume it
            .
        )+  # ... as many times as possible
      /x;
    
      qr/ ^ $field $sep ($field) /x;
    }
    

    Use it as in

    my @source = ...;  # same as above
    
    my $second_trimmed_field_pattern = make_second_trimmed_field_pattern;
    for (@source) {
      if (/$second_trimmed_field_pattern/) {
        print "[$1]\n";
      }
      else {
        chomp;
        print "no match for [$_]\n";
      }
    }
    

    Output:

    $ perl5.8.8 prog
    [something2]
    [something3]
    [something4]
    [something 5]