Search code examples
regexunicodegrep

grepping for Russian characters using character ranges


How to grep for lines with 'Й' and 'й' from text file, using character ranges?

In Unicode, Russian capital characters (except 'Ё') are in the range from 0x410 to 0x42f in alphabetical order, and small characters (except 'ё') are in the range from 0x430 to 0x44f in alphabetical order. This means that [А-ИК-ЯЁ] should match all Russian characters except 'Й', and [а-ик-яё] should match all Russian characters except 'й'. But this turns out to be not quite the case.

For experimenting, I created a function that outputs Russian characters [Ж-Мж-м], one per line:

rus () { for char in Ж З И Й К Л М ж з и й к л м; do echo $char; done; }

I also exported the appropriate collate setting:

export LC_COLLATE=ru_RU.UTF-8

Without character ranges everything worked as expected:

rus | grep -v "[АБВГДЕЁЖЗИКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя]"

and

rus | grep -v "[АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзиклмнопрстуфхцчшщъыьэюя]"

output 'Й' and 'й' respectively.

With character ranges, [А-ИК-ЯЁа-ик-яё] should match all Russian characters except 'Й' and 'й', and this turned out to be correct. But when I wanted to filter only 'Й' or only 'й', something interesting happened: on my system, both

rus | grep -v "[А-ЯЁа-ик-яё]"  # expected output: 'й'

and

rus | grep -v "[А-ИК-ЯЁа-яё]"  # expected output 'Й'

output nothing!

'Й' and 'й' are not special in this respect; the analogical experiment with letters 'П' and 'п' showed the same effect.

Is grep maybe, for some reason, handling Russian or Cyrillic characters case-insensitively by default in character ranges? No, it is not: adding --no-ignore-case to all those grep commands changed nothing.

What's going on? Have I found a bug in grep? Or am I missing something?

(I am using GNU grep 3.11 (built with pcre), and bash 5.1.16.)


Solution

  • First, you should quote the argument to grep; if you don't, and you have a file in the current directory whose name is a single Russian letter, that letter will be the only thing passed to grep.

    But the problem is that grep without PCRE appears to work bytewise, regardless of locale settings. So I think you need to turn on Perl-compatible mode with -P:

    $ rus | grep -Pv '[А-ЯЁа-ик-яё]'
    й
    

    Whenever you suspect problems interpreting the argument to grep, a good sanity check is to fall back to sending it pure-ASCII strings, using the \x{...} syntax for non-ASCII characters (which is also a feature of PCRE, so only works with -P):

    $ rus | grep -Pv '[\x{0410}-\x{042f}Ёx\{0430}-\x{0438}\x{043a}-\x{044f}ё]'
    й