Search code examples
bashcharacter-encoding

bash uses wrong character length for multi-byte characters


Sometimes I have a typo in a bash command line like in

> ls foo.cö
ls: cannot access 'foo.c'$'\303\266': No such file or directory

What I am doing then is to type <up> to get that command line from the history and edit it, i.e. remove the ö by typing a backspace. This deletes the ö but I am getting:

> ls foo.c
ls: cannot access 'foo.c'$'\303': No such file or directory

What would work is to use two backspaces, which gives the non-intuitive correct result:

> ls foo.
foo.c

Is there any way so that the backspace will delete all of the umlaut, not only half of its encoding?

What would also be nice is if the diagnostic would print the correct argument, like ls: cannot access 'foo.cö': No such file or directory.


FYI, locale prints

LANG=C
LANGUAGE=de_DE
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=

Exporting LANGUAGE=de_DE.UTF-8 has no effect, and what I don't want to set other variables to other values because I don't know unwanted side-effects of that.


Solution

  • The C locale does not support umlauts. Use one of the unicode locales such as C.UTF-8 or en_US.UTF-8: LANG='C.UTF-8'