Search code examples
awkcharacter-encodingdiacritics

Awk tolower a string that starts with an accent - support for foreign characters


I have a file with this string in a line: "Ávila"

And I want to get this output: "ávila".

The problem is that the function tolower of awk only works when the string does not start with accent, and I must use awk.

For example, if I do awk 'BEGIN { print tolower("Ávila") }' then I get "Ávila" instead of "ávila", that is what I expect.

But if I do awk 'BEGIN { print tolower("Castellón") }' then I get "castellón"


Solution

  • For a given awk implementation to work properly with non-ASCII characters (foreign letters), it must respect the active locale's character encoding, as reflected in the (effective) LC_CTYPE setting (run locale to see it).

    These days, most locales use UTF-8 encoding, a multi-byte-on-demand encoding that is single-byte in the ASCII range, and uses 2 to 4 bytes to represent all other Unicode characters.
    Thus, for a given awk implementation to recognize non-ASCII (accented, foreign) letters, it must be able to recognize multiple bytes as a single character.

    Among the major awk implementations,

    • GNU Awk (gawk), the default on some Linux distros
    • BSD awk, as also used on OS X
    • Mawk (mawk), the default on Debian-based Linux distros such as Ubuntu

    only GNU Awk properly handles UTF8-encoded characters (and presumably any other encoding if specified in the locale):

    $ echo ÁvilA | gawk '{print tolower($0)}'
    ávila  # both Á and A lowercased
    

    Conversely, if you expressly want to limit character processing to ASCII only, prepend LC_CTYPE=C:

    $ echo ÁvilA | LC_CTYPE=C gawk '{print tolower($0)}'
    Ávila  # only ASCII char. A lowercased
    

    Practical advice:

    • To determine what implementation your default awk is, run awk --version.

      • In the case of Mawk you'll get an error message, because it only supports printing version information with -W version, but that error message will contain the word mawk.
    • If possible, install and use GNU Awk (and optionally make it the default awk); it is available for most Unix-like platforms; e.g.:

      • On Debian-based platforms such as Ubuntu: sudo apt-get install gawk
      • On OS X, using Homebrew: brew install gawk.
    • If you must use either BSD Awk or Mawk, use the above LC_CTYPE=C approach to ensure that the multi-byte UTF-8 characters are at least passed through without modification.[1], but foreign letters will NOT be recognized as letters (and thus won't be lowercased, in this case).


    [1] BSD Awk and Mawk on OS X (the latter curiously not on Linux) treat UTF-8-encoded character as follows:

    • Each byte is mistakenly interpreted as its own character.
    • If, after ignoring the high bit, the resulting byte value falls into the range of ASCII uppercase letters, 32 is added to the original byte value to obtain the lowercase counterpart.

    In the case at hand, this means:

    • Á is Unicode codepoint U+00C1, whose UTF-8 encoding is the 2-byte sequence: 0xC3 0x81.

    • 0xC3: Dropping the high bit (0xC3 & 0x7F) yields 0x43, which is interpreted as ASCII letter C, and 32 (0x20) is therefore added to the original value, yielding 0xE3 (0xC3 + 0x20).

    • 0x81: Dropping the high bit (0x81 & 0x7F) yields 0x1, which is not in the range of ASCII uppercase letters (65-90, 0x41-0x5a), so the byte is left as-is.

    • Effectively, the first byte is modified from 0xC3 to 0xE3, while the 2nd byte is left untouched; since 0xC3 0x81 is not a properly UTF-8-encoded character, the terminal will print ? instead to signal that.