Search code examples
regexperlsplitsubstr

How to split a string containing a hyphen in perl


I have a string referred to as ($date) that I am trying to split into two parts using Perl.

$date= (June 25, 2018–July 1, 2018)

From what I have read it seems that the proper way to split this string into the two separate dates would be to create a new array, use the Perl split() function with the hyphen as a delimiter and then assign the array index values to my StartDate/EndDate variables like this...

@dates = split(/-/, $date);
  $StartDate = @dates[0];
  $EndDate = @dates[1];

print "Effective Date: ($date)\n";
print "($StartDate)";
print "\n";
print "($EndDate)";

However this does not work as I expected it to.

Please keep in mind that the code above is only a small section of the source code.

Current Output (Incorrect)

Effective Date: (June 25, 2018–July 1, 2018)
(June 25, 2018–July 1, 2018)
()

Expected Output (Correct)

Effective Date: (June 25, 2018–July 1, 2018)
(June 25, 2018)
(July 1, 2018)

Looking for any advice on how to achieve my goal.


Solution

  • The problem here is that you're trying to split on - (U+002D HYPHEN-MINUS) but your string contains (U+2013 EN DASH).

    There are a couple of ways you can specify this character in a regex:

    use utf8;
    ...
    my ($StartDate, $EndDate) = split /–/, $date;
    

    use utf8 tells perl that your source code is in UTF-8, so you can use Unicode characters literally.

    my ($StartDate, $EndDate) = split /\x{2013}/, $date;
    

    Or you can use a hex character code.

    my ($StartDate, $EndDate) = split /\N{EN DASH}/, $date;
    

    Or a named character reference.

    If you don't necessarily want to split on EN DASH but any dash-like character, you can use a character class based on the "Dash" property:

    my ($StartDate, $EndDate) = split /\p{Dash}/, $date;
    

    Note that @dates[0] will trigger a warning (if use warnings is enabled, which it should be) because a single element of an array @foo is spelled $foo[0] in Perl. The syntax @array[ LIST ] is used for array slices, i.e. extracting multiple elements by their indices.