Search code examples
greplocale

correct locale setting for devnagari unicode text


The following output is wrong. There should be only 1 word returned insted of 2

$ echo 'उद्योजकता' |  grep -o -E '\w+'
उद
योजकता

I have been told this is due to locale setting. I have checked it on 2 different servers with 2 different O/S and the results are the same.

Ubuntu

$ locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=

AWS EC2

# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

I am not sure which locale setting should be selected to get the Devnagari unicode text to break only at space.


Solution

  • Edited to add: You can use something like this instead if the grep on your machine supports Perl-compatible regular expressions (PCRE). This should be the case e.g., on Amazon Linux.

    echo 'उद्योजकता' |grep -o -P '[\w\pL\pM]+'
    

    which will match "word" characters OR "Letter" code points OR "Mark" code points,

    OR

    echo 'उद्योजकता' |grep -o -P '(*UCP)[\w\pM]+'
    

    which will enable unicode character property (UCP) matching so that \w matches all letters and numbers no matter the script, but you must still include \pM in the pattern because \w simply does not match "Mark" points in grep -- they are not alphanumeric code points.

    Be careful with the above! I don't know Devanagari script so I don't know if it's appropriate or not to consider all such "Mark" characters as being part of a word for your purposes. It might be that a narrower set such as \Mn (for non-spacing marks) is more appropriate for your needs, or perhaps there are only a few specific points to include in which case you'd need to select them individually in your pattern.


    In the old days, \w just meant [A-Za-z0-9_]. It was oriented around ASCII, and C code. Today, in typical interpretations, it still means "alphanumeric and underscore", which can vary depending on locale.

    You say that the output is "wrong", but I'm afraid "wrong" is dependent on which regular expression engine you are using. So even though you are using "grep", the question is which grep, on which OS, etc., etc.

    Your input contains 0x094d, as far as I can tell, which is not a "Letter", according to the unicode character definition (at least, not per the link above). It is a "Mark".

    There is a Unicode "document" (recommendation) which includes how an engine might define "\w" to be Unicode-smart, and indeed it suggests to include Mark codepoints in the match. So your expectation is natural in that sense. However, you can see from the same link that there is no way to do this and also to be strictly POSIX-compliant at the same time, which lots of regex engines want to do.

    Wikipedia indicates that there are some engines which support the Unicode property definitions, but in general, grep isn't going to do it. I'm not familiar enough with those engines (ruby, etc) to say exactly how you should attempt the same thing on command line as you are trying to do with grep.