Search code examples
perlunicode

Regex in Perl not matching a Unicode String


I'm trying to match a Unicode string using Perl regex. The string seems to arrive at my module unscathed with proper encoding if I output it to STDOUT: "Asuncion, Distrito Capital de Paraguay, Región Oriental, Paraguay."

However, it won't match in Regex. Oddly, if I copy the output of the script into a variable and evaluate that, that does match in the same Regex:

use v5.12;
use utf8;

my $placeString = $main::FORM{'placeString'}; # Coming from a different module.
say STDOUT $placeString;

utf8::upgrade($placeString); # Using this or removing this seems to make no difference.

# If I manually copy the output of STDOUT (above) in BASH and set the string, it works:
$placeString2 = "Asuncion, Distrito Capital de Paraguay, Región Oriental, Paraguay"; 

if ($placeString =~ /^([\w\s\,\.\-\']+)$/) {
    # This is evaluated as false.
    say STDERR "Accepted placename.";
}


if ($placeString2 =~ /^([\w\s\,\.\-\']+)$/) {
    # This is evaluated as true.
    say STDERR "Accepted placename.";
}

Solution

  • However, it won't match in Regex.

    From the comments it becomes clear that the value is a UTF-8 encoded string. You need to decode the value before doing the match:

    use Encode qw(decode_utf8);
    $placeString = decode_utf8($placeString);