Search code examples
javascriptjquerycharactercounterword-count

How to do word counts for a mixture of English and Chinese in Javascript


I want to count the number of words in a passage that contains both English and Chinese. For English, it's simple. Each word is a word. For Chinese, we count each character as a word. Therefore, 香港人 is three words here.

So for example, "I am a 香港人" should have a word count of 6.

Any idea how can I count it in Javascript/jQuery?

Thanks!


Solution

  • Try a regex like this:

    /[\u00ff-\uffff]|\S+/g
    

    For example, "I am a 香港人".match(/[\u00ff-\uffff]|\S+/g) gives:

    ["I", "am", "a", "香", "港", "人"]
    

    Then you can just check the length of the resulting array.

    The \u00ff-\uffff part of the regex is a unicode character range; you probably want to narrow this down to just the characters you want to count as words. For example, CJK Unified would be \u4e00-\u9fcc.

    function countWords(str) {
        var matches = str.match(/[\u00ff-\uffff]|\S+/g);
        return matches ? matches.length : 0;
    }