I'm getting this warning from awk
, when it processes a long file on a server:
Invalid multibyte data detected.
There may be a mismatch between your data and your locale
Now, I want to reproduce the message with a test data, on my laptop:
$ printf '\xC0\x80' | LC_ALL=en_US.UTF-8 awk '{print length}'
2
Doesn't work. I don't get the warning. What is the right data to pass to it in order to see the warning?
This is what I see on my laptop:
$ locale
LANG=""
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
This is what I see on the server:
# locale
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
By the way, on the server this code reproduces the problem:
# printf '\xC0\x80' | awk '{print length}'
awk: cmd. line:1: (FILENAME=- FNR=1) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
2
How to make it print this warning on my laptop too?
On my mac I can reproduce the problem using:
printf '\xC0\x80' | LC_ALL="en_US.UTF-8" gawk '{print length}'
gawk: cmd. line:1: (FILENAME=- FNR=1) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale
And also:
echo -e '\xC0\x80' | LC_ALL= gawk '{print length}'
gawk: cmd. line:1: (FILENAME=- FNR=1) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale
But I'm not able to trigger the warning using the macOS 'system' awk (awk version 20200816; not GNU AWK) with the above commands, so I guess your problem is caused by different versions of AWK on your laptop and on the server.