Search code examples
bashsortingcygwinlocale

How to sort words with accents?


I was wondering how to sort alphabetically a list of Spanish words [with accents].

Excerpt from the word list:

Chocó
Cundinamarca
Córdoba

Solution

  • Cygwin uses GNU utilities, which are usually well-behaved when it comes to locales - a notable and regrettable exception is awk (gawk)ref.

    The following is based on Cygwin 1.7.31-3, current as of this writing.

    • Cygwin by default uses the locale implied by the current Windows user's UI language, combined with UTF-8 character encoding.
      • Note that it's NOT based on the setting for date/time/number/currency formats, and changing that makes no difference. The limitation of basing the locale on the UI language is that it invariably uses that language's "home" region; e.g., if your UI language is Spanish, Cygwin will invariably use en_ES, i.e., Spain's locale. The only way to change that is to explicitly override the default - see below.
    • You can override this in a variety of ways, preferably by defining a persistent Windows environment variable named LANG (see below; for an overview of all methods, see https://superuser.com/a/271423/139307)

    To see what locale is in effect in Cygwin, run locale and inspect the value of the LANG variable.

    If that doesn't show es_*.utf8 (where * represents your region in the Spanish-speaking world, e.g., CO for Colombia, ES for Spain, ...), set the locale as follows:

    • In Windows, open the Start menu and search for 'environment', then select Edit environment variables for your account, which opens the Environment Variables dialog.
    • Edit or create a variable named LANG with the desired locale, e.g., es_CO.utf8 -- UTF-8 character encoding is usually the best choice.

    Any Cygwin bash shell you open from the on should reflect the new locale - verify by running locale and ensuring that the LC_* values match the LANG value and that no warnings are reported.

    At that point, the following:

    sort <<<$'Chocó\nCundinamarca\nCórdoba'
    

    should produce (i.e., ó will sort directly after o, as desired):

    Chocó
    Córdoba
    Cundinamarca
    

    Note: locale en_US.utf8 would produce the same output - apparently, it generically sorts accented characters directly after their base characters - which may or may not be what a specific non-US locale actually does.