Search code examples

How to sort words with accents?

I was wondering how to sort alphabetically a list of Spanish words [with accents].

Excerpt from the word list:



  • Cygwin uses GNU utilities, which are usually well-behaved when it comes to locales - a notable and regrettable exception is awk (gawk)ref.

    The following is based on Cygwin 1.7.31-3, current as of this writing.

    • Cygwin by default uses the locale implied by the current Windows user's UI language, combined with UTF-8 character encoding.
      • Note that it's NOT based on the setting for date/time/number/currency formats, and changing that makes no difference. The limitation of basing the locale on the UI language is that it invariably uses that language's "home" region; e.g., if your UI language is Spanish, Cygwin will invariably use en_ES, i.e., Spain's locale. The only way to change that is to explicitly override the default - see below.
    • You can override this in a variety of ways, preferably by defining a persistent Windows environment variable named LANG (see below; for an overview of all methods, see

    To see what locale is in effect in Cygwin, run locale and inspect the value of the LANG variable.

    If that doesn't show es_*.utf8 (where * represents your region in the Spanish-speaking world, e.g., CO for Colombia, ES for Spain, ...), set the locale as follows:

    • In Windows, open the Start menu and search for 'environment', then select Edit environment variables for your account, which opens the Environment Variables dialog.
    • Edit or create a variable named LANG with the desired locale, e.g., es_CO.utf8 -- UTF-8 character encoding is usually the best choice.

    Any Cygwin bash shell you open from the on should reflect the new locale - verify by running locale and ensuring that the LC_* values match the LANG value and that no warnings are reported.

    At that point, the following:

    sort <<<$'Chocó\nCundinamarca\nCórdoba'

    should produce (i.e., ó will sort directly after o, as desired):


    Note: locale en_US.utf8 would produce the same output - apparently, it generically sorts accented characters directly after their base characters - which may or may not be what a specific non-US locale actually does.