I'm trying to 'fix' some files that have unexpected unicode characters using perl's regex with the '''\N{UNICODE NAME}''' construct. But for some reason that I don't understand fully, nothing happens, but there are no error messages. Here is a simple test example.
[2007]$ read ZZ < test.txt && unum "${ZZ}"
Octal Decimal Hex HTML Character Unicode
0101 65 0x41 A "A" LATIN CAPITAL LETTER A
040 32 0x20   " " SPACE, SP
0341 225 0xE1 á "á" LATIN SMALL LETTER A WITH ACUTE
040 32 0x20   " " SPACE, SP
0334 220 0xDC Ü "Ü" LATIN CAPITAL LETTER U WITH DIAERESIS
040 32 0x20   " " SPACE, SP
0321 209 0xD1 Ñ "Ñ" LATIN CAPITAL LETTER N WITH TILDE
040 32 0x20   " " SPACE, SP
040 32 0x20   " " SPACE, SP
062745 26085 0x65E5 日 "日" CJK UNIFIED IDEOGRAPH-#65E5, IRGKangXi=0489.010, RSKangXi=72.0, Def{sun; day; daytime}
063454 26412 0x672C 本 "本" CJK UNIFIED IDEOGRAPH-#672C, IRGKangXi=0509.070, RSKangXi=75.1, Def{root, origin, source; basis}
0105236 35486 0x8A9E 語 "語" CJK UNIFIED IDEOGRAPH-#8A9E, IRGKangXi=1163.080, RSKangXi=149.7, Def{language, words; saying, expression}
040 32 0x20   " " SPACE, SP
061 49 0x31 1 "1" DIGIT ONE
040 32 0x20   " " SPACE, SP
040 32 0x20   " " SPACE, SP
0177421 65297 0xFF11 1 "1" FULLWIDTH DIGIT ONE
040 32 0x20   " " SPACE, SP
040 32 0x20   " " SPACE, SP
057 47 0x2F / "/" SOLIDUS
040 32 0x20   " " SPACE, SP
040 32 0x20   " " SPACE, SP
0137 95 0x5F _,_ "_" LOW LINE
Now, when I try and replace one character as a test using perl inline, eg
[2008]$ perl -p -e 's/\N{LATIN CAPITAL LETTER U WITH DIAERESIS}+/X/gu;' test.txt
A á Ü Ñ 日本語 1 1 / _
There's no error, but no substitution either. I also tried,
[2013]$ perl -e 'BEGIN { use charnames q{:full}; }' -p -e 's/\N{LATIN CAPITAL LETTER U WITH DIAERESIS}+/X/gu;' test.txt
A á Ü Ñ 日本語 1 1 / _
with no change. What am I missing, the documentation seems to imply that this should work?
If I make a direct substitution it works as expected,
[2015]$ perl -p -e 's/日+/X/gu;' test.txt
A á Ü Ñ X本語 1 1 / _
You have to tell perl that the input is in UTF-8, and that standard output is UTF-8 too (Well, the latter can be skipped but you'll get a warning):
In a one-liner, the -C
command line options arguments controls what's considered UTF-8: D
tells perl to use UTF-8 as the default encoding for PerlIO channels that are opened (For both read and write; there are other options for just reading or just writing; see perlrun for details), and S
says that all standard streams (Input, output and error) are UTF-8 encoded.
So...
$ perl -CDS -wpe 's/\N{LATIN CAPITAL LETTER U WITH DIAERESIS}+/X/gu;' test.txt
A á X Ñ 日本語 1 1 / _