Search code examples
bashunicodeutf-8sedlocale

Locale error when adding a space after each character for a unicode file?


I want to add spaces to each character in a textfile

in.txt

在吗??
嗯
你让我看的那款手提是不是11寸的,很小的?
看来还是美国的便宜啊
应该是吧

out.txt

在 吗 ? ?
嗯
你 让 我 看 的 那 款 手 提 是 不 是 1 1 寸 的 , 很 小 的 ?
看 来 还 是 美 国 的 便 宜 啊
应 该 是 吧

I've tried this (How to remove/add spaces in all textfiles?) but it outputs:

� � � � � � � � � � � � 
� � � 
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 1 1 � � � � � � � � � � � � � � � � � � � � 
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � 
� � � � � � � � � � � � 

How do I achieve out.txt?


I've also tried:

$ perl -F'' -C -lane 'print join " ", @F' in.txt 
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LC_PAPER = "de_DE.UTF-8",
    LC_ADDRESS = "de_DE.UTF-8",
    LC_MONETARY = "de_DE.UTF-8",
    LC_NUMERIC = "de_DE.UTF-8",
    LC_TELEPHONE = "de_DE.UTF-8",
    LC_IDENTIFICATION = "de_DE.UTF-8",
    LC_MEASUREMENT = "de_DE.UTF-8",
    LC_TIME = "de_DE.UTF-8",
    LC_NAME = "de_DE.UTF-8",
    LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
� � � � � � � � � � � �
� � �
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 1 1 � � � � � � � � � � � � � � � � � � � �
� � � � � � � � � � � � � � � � � � � � � � � � � � � � �
� � � � � � � � � � � �

And

$ cat in.txt 在吗??
嗯
你让我看的那款手提是不是11寸的,很小的?
看来还是美国的便宜啊
应该是吧
$ sed 's/\s/g;s/./& /g'  in.txt
sed: -e expression #1, char 10: unknown option to `s'

The seem to be something wrong with my locale:

$ locale
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=de_DE.UTF-8
LC_TIME=de_DE.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=de_DE.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=de_DE.UTF-8
LC_NAME=de_DE.UTF-8
LC_ADDRESS=de_DE.UTF-8
LC_TELEPHONE=de_DE.UTF-8
LC_MEASUREMENT=de_DE.UTF-8
LC_IDENTIFICATION=de_DE.UTF-8
LC_ALL=

To fix it, i had to do:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

And then:

$ perl -F'' -C -lane 'print join " ", @F' in.txt 
在 吗 ? ?
嗯
你 让 我 看 的 那 款 手 提 是 不 是 1 1 寸 的 , 很 小 的 ?
看 来 还 是 美

Solution

  • Assuming you have a UTF-8 locale set up correctly, you can use this Perl one-liner:

    perl -F'' -C -lane 'print join " ", @F' in.txt > out.txt
    

    The -a switch splits the input on the field separator, which has been set to an empty string, so each character is a separate element in the array @F. Since this uses join, there is no space added after the last character on the line (it's not clear whether there should be one or not).

    Another option is to use a substitution:

    perl -C -pe 's/(.)/$1 /g' in.txt > out.txt
    

    This will add a space after every character, including the last one.