Search code examples
shellutf-8wc

how to use shell to count Chinese characters in file encoded in UTF-8


cat doc.txt and the following characters will show:

你好 Hello!
这是中文。This is a Chinese doc.

I can use the command

wc -w doc.txt

but it will show:

8 doc.txt

this command take characters 你好 and 这是中文 both as a single word, while in fact 你好 are two Chinese words and 这是中文 four.

What I want is to get these Chinese words counting right(there are 12 words in the example), could anyone help out?


Solution

  • You can use -m or --chars option:

    $ echo -n "你好" | wc -m  
    

    Output:

    2