Search code examples
regexperlunicode-escapes

Perl regex substitutions with unicode character don't work, what am I missing?


I'm trying to 'fix' some files that have unexpected unicode characters using perl's regex with the '''\N{UNICODE NAME}''' construct. But for some reason that I don't understand fully, nothing happens, but there are no error messages. Here is a simple test example.

[2007]$ read ZZ < test.txt && unum "${ZZ}"
   Octal  Decimal      Hex        HTML    Character   Unicode
    0101       65     0x41       &#65;    "A"         LATIN CAPITAL LETTER A
     040       32     0x20       &#32;    " "         SPACE, SP
    0341      225     0xE1    &aacute;    "á"         LATIN SMALL LETTER A WITH ACUTE
     040       32     0x20       &#32;    " "         SPACE, SP
    0334      220     0xDC      &Uuml;    "Ü"         LATIN CAPITAL LETTER U WITH DIAERESIS
     040       32     0x20       &#32;    " "         SPACE, SP
    0321      209     0xD1    &Ntilde;    "Ñ"         LATIN CAPITAL LETTER N WITH TILDE
     040       32     0x20       &#32;    " "         SPACE, SP
     040       32     0x20       &#32;    " "         SPACE, SP
  062745    26085   0x65E5    &#26085;    "日"         CJK UNIFIED IDEOGRAPH-#65E5, IRGKangXi=0489.010, RSKangXi=72.0, Def{sun; day; daytime}
  063454    26412   0x672C    &#26412;    "本"         CJK UNIFIED IDEOGRAPH-#672C, IRGKangXi=0509.070, RSKangXi=75.1, Def{root, origin, source; basis}
 0105236    35486   0x8A9E    &#35486;    "語"         CJK UNIFIED IDEOGRAPH-#8A9E, IRGKangXi=1163.080, RSKangXi=149.7, Def{language, words; saying, expression}
     040       32     0x20       &#32;    " "         SPACE, SP
     061       49     0x31       &#49;    "1"         DIGIT ONE
     040       32     0x20       &#32;    " "         SPACE, SP
     040       32     0x20       &#32;    " "         SPACE, SP
 0177421    65297   0xFF11    &#65297;    "1"         FULLWIDTH DIGIT ONE
     040       32     0x20       &#32;    " "         SPACE, SP
     040       32     0x20       &#32;    " "         SPACE, SP
     057       47     0x2F       &sol;    "/"         SOLIDUS
     040       32     0x20       &#32;    " "         SPACE, SP
     040       32     0x20       &#32;    " "         SPACE, SP
    0137       95     0x5F &lowbar;,&UnderBar;    "_"         LOW LINE

Now, when I try and replace one character as a test using perl inline, eg

[2008]$ perl -p -e 's/\N{LATIN CAPITAL LETTER U WITH DIAERESIS}+/X/gu;' test.txt
A á Ü Ñ  日本語 1  1  /  _

There's no error, but no substitution either. I also tried,

[2013]$ perl -e 'BEGIN { use charnames q{:full}; }' -p -e 's/\N{LATIN CAPITAL LETTER U WITH DIAERESIS}+/X/gu;' test.txt
A á Ü Ñ  日本語 1  1  /  _

with no change. What am I missing, the documentation seems to imply that this should work?

If I make a direct substitution it works as expected,

[2015]$ perl -p -e 's/日+/X/gu;' test.txt
A á Ü Ñ  X本語 1  1  /  _

Solution

  • You have to tell perl that the input is in UTF-8, and that standard output is UTF-8 too (Well, the latter can be skipped but you'll get a warning):

    In a one-liner, the -C command line options arguments controls what's considered UTF-8: D tells perl to use UTF-8 as the default encoding for PerlIO channels that are opened (For both read and write; there are other options for just reading or just writing; see perlrun for details), and S says that all standard streams (Input, output and error) are UTF-8 encoded.

    So...

    $ perl -CDS -wpe 's/\N{LATIN CAPITAL LETTER U WITH DIAERESIS}+/X/gu;' test.txt
    A á X Ñ  日本語 1  1  /  _