Search code examples
bashawkunicodeencoding

How to reproduce awk warning "Invalid multibyte data detected"?


I'm getting this warning from awk, when it processes a long file on a server:

Invalid multibyte data detected.
There may be a mismatch between your data and your locale

Now, I want to reproduce the message with a test data, on my laptop:

$ printf '\xC0\x80' | LC_ALL=en_US.UTF-8 awk '{print length}'
2

Doesn't work. I don't get the warning. What is the right data to pass to it in order to see the warning?

This is what I see on my laptop:

$ locale
LANG=""
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

This is what I see on the server:

# locale
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

By the way, on the server this code reproduces the problem:

# printf '\xC0\x80' | awk '{print length}'
awk: cmd. line:1: (FILENAME=- FNR=1) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
2

How to make it print this warning on my laptop too?


Solution

  • On my mac I can reproduce the problem using:

    printf '\xC0\x80' | LC_ALL="en_US.UTF-8" gawk '{print length}'
    gawk: cmd. line:1: (FILENAME=- FNR=1) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale
    

    And also:

    echo -e '\xC0\x80' | LC_ALL= gawk '{print length}'
    gawk: cmd. line:1: (FILENAME=- FNR=1) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale
    

    But I'm not able to trigger the warning using the macOS 'system' awk (awk version 20200816; not GNU AWK) with the above commands, so I guess your problem is caused by different versions of AWK on your laptop and on the server.