Search code examples
unicodeutf-8greplocalenon-ascii-characters

(e)grep: accented characters not recognised as part of a word


I would like to use (e)grep to match a whole word using the -w switch. I've set the locale, but accented characters are being treated as word boundaries as in this example:

$ locale
LANG=es_VE.utf8
LC_CTYPE="es_VE.utf8"
LC_NUMERIC="es_VE.utf8"
LC_TIME="es_VE.utf8"
LC_COLLATE="es_VE.utf8"
LC_MONETARY="es_VE.utf8"
LC_MESSAGES="es_VE.utf8"
LC_ALL=es_VE.utf8

$ echo -e "cáñamo\namo" | egrep -w amo
cáñamo
amo

How can I find amo while ignoring cáñamo


Solution

  • Which code points count as a word-class character is not locale-dependent in Unicode, and LATIN SMALL LETTER N WITH TILDE is always a word character.

    Here’s an all-UTF8 workflow demonstrating searching for amo after a word boundary, and after a non-(word-boundary):

     $ perl -Mutf8 -CSDA  -e 'print "cáñamo\namo\n"' | 
       perl -Mutf8 -CSDA -ne 'print if /\bamo\b/'
     amo
    
     $ perl -Mutf8 -CSDA  -e 'print "cáñamo\namo\n"' | 
       perl -Mutf8 -CSDA -ne 'print if /\Bamo\b/'
     cáñamo
    

    I cannot help but be amused by your choice of search strings. Thanks for the chuckle.