Search code examples
perlunicodegrep

command line filtering of Unicode block


I've been trying for a couple hours to create a conceptually trivial filter that I can use on the command line, without success. The task is to filter out all lines containing Hangul Jamo characters, while retaining all other lines (which may contain ASCII, characters in the Hangul Syllable block, etc.).

So for example if the input was

 foo
 ᅤᆨ
 간

the output would contain the first and third lines, but not the second, since the second line contains Jamo characters. (The above is not meant to be real Korean, just a simple test case.)

I'm very disappointed with the Gnu grep utility (version 2.20). I would have thought the ff. would work:

grep -Pv '[\x{1100}-\x{11FF}]'

but instead I get the error message grep: character value in \x{...} sequence is too large. (The \u1100 syntax, which is the actual Perl syntax, simply isn't supported.)

(I do notice that our version 2.20 is rather old. If someone tries the above with a newer version of grep, and it works, I'll certainly consider that an answer--and I'll get our IT folks to upgrade!)

I tried sed, but didn't get any further. (Sorry, I don't remember exactly what sed commands I tried, but sed's support for Unicode blocks doesn't seem any better than grep's.)

Finally, I tried perl (v5.16.3):

 perl -ne 'print unless /[\u1100-\u11ff]/'

This at least succeeds in eliminating the Jamo lines while retaining the Hangul Syllable lines, but it also eliminates the ASCII lines, which I don't want to do. I also would have thought one of the ff. would work:

perl -ne 'print unless /\p{InHangul_Jamo}/'
perl -ne 'print unless /\p{Block: Hangul_Jamo}/'

but neither appears to have any effect. (Afaik, I shouldn't have to have a .* on each side of the \p{...}, but I tried that too; no luck.)

Locale: in case it matters, I have LANG=en_US.UTF-8.

I'm sure I could do this in Python, but I'd like to understand why neither grep nor perl seems to work, because they'd be a lot simpler. (And if I'm right about the Gnu utilities having poor Unicode support, why that is...and when it will be fixed. It's not like Unicode is new!) Of course I realize the problem may be that I'm not holding my mouth right when I try, but if so, it would be nice for grep at least to have better documentation on Unicode usage. Right now the documentation for grep -P says "This is highly experimental and grep -P may warn of unimplemented features." And it seems to have been that way roughly forever.


Solution

  • Decode inputs, encode outputs. If the encoding in question is UTF-8, the command-line switch -CSD will come in useful.

    perl -CSD -ne'print if !/\p{Block: Hangul_Jamo}/'
    perl -CSD -ne'print if !/\p{Block: Jamo}/'
    perl -CSD -ne'print if !/\p{Blk=Jamo}/'
    perl -CSD -ne'print if !/\p{InJamo}/'
    perl -CSD -ne'print if !/[\N{U+1100}-\N{U+11FF}]/'
    perl -CSD -ne'print if !/[\x{1100}-\x{11FF}]/'
    grep -vP '[\x{1100}-\x{11FF}]'
    

    You might want to add the Hangul_Jamo_Extended_A, Hangul_Jamo_Extended_B and Hangul_Compatibility_Jamo blocks.

    perl -CSD -ne'print if !/[\p{Block: Hangul_Jamo}\p{Block: Hangul_Jamo_Extended_A}\p{Block: Hangul_Jamo_Extended_B}\p{Block: Hangul_Compatibility_Jamo}]/'
    perl -CSD -ne'print if !/[\p{Block: Jamo}\p{Block: JamoExtA}\p{Block: JamoExtB}\p{Block: CompatJamo}]/'
    perl -CSD -ne'print if !/[\p{Blk=Jamo}\p{Blk=JamoExtA}\p{Blk=JamoExtB}\p{Blk=CompatJamo}]/'
    perl -CSD -ne'print if !/[\p{InJamo}\p{InJamoExtA}\p{InJamoExtB}\p{InCompatJamo}]/'
    perl -CSD -ne'print if !/[\N{U+1100}-\N{U+11FF}\N{U+A960}-\N{U+A97F}\N{U+D7B0}-\N{U+D7FF}\N{U+3130}-\N{U+318F}]/'
    perl -CSD -ne'print if !/[\x{1100}-\x{11FF}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}\x{3130}-\x{318F}]/'
    grep -vP '[\x{1100}-\x{11FF}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}\x{3130}-\x{318F}]'
    

    Let's look at your failed attempts.

    • grep -Pv '[\x{1100}-\x{11FF}]'

      Actually, this one should work, and it does for me.

      $ perl -CSD -e'print "abc\nd\x{1100}f\nghi\n"' | od -t x1
      0000000 61 62 63 0a 64 e1 84 80 66 0a 67 68 69 0a
      0000016
      
      $ perl -CSD -e'print "abc\nd\x{1100}f\nghi\n"' | grep -Pv '[\x{1100}-\x{11FF}]'
      abc
      ghi
      
      $ grep --version | head -1
      grep (GNU grep) 2.16
      

      I do get your error on an older machine with grep (GNU grep) 2.10.

    • perl -ne'print unless /\p{Block: Hangul_Jamo}/'

      You didn't get any matches from /\p{Block: Hangul_Jamo}/ because you were matching against encoded text (UTF-8 bytes, chars in the range 00..FF) instead of decoded text (Unicode Code Points, chars in the range 00000..10FFFF).

    • perl -ne 'print unless /\p{InHangul_Jamo}/'

      \p{Block: X}, \p{Blk=X} and \p{InX} are equivalent.

    • perl -ne'print unless /[\x{1100}-\x{11FF}]/'

      [\x{1100}-\x{11FF}] is equivalent to \p{Block: Hangul_Jamo}.

    • perl -ne'print unless /[\u1100-\u11ff]/'

      You got too many matches since \u in double-quoted string literals and in regex pattern literals titlecases the next character. (e.g. "\uxyx" is equivalent to "Xyz".)

      As such, [\u1100-\u11ff] is equivalent to [01f].