I have a file with this string in a line: "Ávila"
And I want to get this output: "ávila".
The problem is that the function tolower of awk only works when the string does not start with accent, and I must use awk.
For example, if I do awk 'BEGIN { print tolower("Ávila") }' then I get "Ávila" instead of "ávila", that is what I expect.
But if I do awk 'BEGIN { print tolower("Castellón") }' then I get "castellón"
For a given awk
implementation to work properly with non-ASCII characters (foreign letters), it must respect the active locale's character encoding, as reflected in the (effective) LC_CTYPE
setting (run locale
to see it).
These days, most locales use UTF-8 encoding, a multi-byte-on-demand encoding that is single-byte in the ASCII range, and uses 2 to 4 bytes to represent all other Unicode characters.
Thus, for a given awk
implementation to recognize non-ASCII (accented, foreign) letters, it must be able to recognize multiple bytes as a single character.
Among the major awk
implementations,
gawk
), the default on some Linux distrosawk
, as also used on OS Xmawk
), the default on Debian-based Linux distros such as Ubuntuonly GNU Awk properly handles UTF8-encoded characters (and presumably any other encoding if specified in the locale):
$ echo ÁvilA | gawk '{print tolower($0)}'
ávila # both Á and A lowercased
Conversely, if you expressly want to limit character processing to ASCII only, prepend LC_CTYPE=C
:
$ echo ÁvilA | LC_CTYPE=C gawk '{print tolower($0)}'
Ávila # only ASCII char. A lowercased
Practical advice:
To determine what implementation your default awk
is, run awk --version
.
-W version
, but that error message will contain the word mawk
.If possible, install and use GNU Awk (and optionally make it the default awk
); it is available for most Unix-like platforms; e.g.:
sudo apt-get install gawk
brew install gawk
.If you must use either BSD Awk or Mawk, use the above LC_CTYPE=C
approach to ensure that the multi-byte UTF-8 characters are at least passed through without modification.[1], but foreign letters will NOT be recognized as letters (and thus won't be lowercased, in this case).
[1] BSD Awk and Mawk on OS X (the latter curiously not on Linux) treat UTF-8-encoded character as follows:
32
is added to the original byte value to obtain the lowercase counterpart.In the case at hand, this means:
Á
is Unicode codepoint U+00C1
, whose UTF-8 encoding is the 2-byte sequence: 0xC3 0x81
.
0xC3
: Dropping the high bit (0xC3 & 0x7F
) yields 0x43
, which is interpreted as ASCII letter C
, and 32
(0x20
) is therefore added to the original value, yielding 0xE3
(0xC3 + 0x20
).
0x81
: Dropping the high bit (0x81 & 0x7F
) yields 0x1
, which is not in the range of ASCII uppercase letters (65-90
, 0x41-0x5a
), so the byte is left as-is.
Effectively, the first byte is modified from 0xC3
to 0xE3
, while the 2nd byte is left untouched; since 0xC3 0x81
is not a properly UTF-8-encoded character, the terminal will print ?
instead to signal that.