Search code examples
phpsymbolsword-countnon-ascii-characters

Creating an effective word counter including Chinese/Japanese and other accented languages


After trying to figure how to have an effective word counter of a string, I know about the existing function that PHP has str_word_count but unfortunately it doesn't do what I need it to do because I will need to count the number of words that includes English, Chinese, Japanese and other accented characters.

However str_word_count fails to count the number of words unless you add the characters in the third argument but this is insane, it could mean I have to add every single character in the Chinese, Japanese, accented characters (etc) language but this is not what I need.

Tests:

str_word_count('The best tool'); // int(3)
str_word_count('最適なツール'); // int(0)
str_word_count('最適なツール', 0, '最ル'); // int(5)

Anyway, I found this function online, it could do the job, but sadly it fails to count:

function word_count($str)
{
    if($str === '')
    {
        return 0;
    }

    return preg_match_all("/\p{L}[\p{L}\p{Mn}\p{Pd}'\x{2019}]*/u", $str);
}

Tests:

word_count('The best tool') // int(3)
word_count('最適なツール'); // int(1)

// With spaces
word_count('最 適 な ツ ー ル'); // int(5)

Basically I'm looking for a good UTF-8 supported word counter that can count words from every typical word/accented/language symbols - is there a possible solution to this?


Solution

  • You can take a look at the mbstring extension to work with UTF-8 strings.

    mb_split() split a mb string using a regex pattern.

    <?php 
    printf("Counting words in: %s\n", $argv[1]);
    mb_regex_encoding('UTF-8');
    mb_internal_encoding("UTF-8");
    $r = mb_split(' ', $argv[1]); 
    print_r($r); 
    printf("Word count: %d\n", count($r));
    
    $ php mb.php "foo bar"
    Counting words in: foo bar
    Array
    (
        [0] => foo
        [1] => bar
    )
    Word count: 2
    
    
    $ php mb.php "最適な ツール"
    Counting words in: 最適な ツール
    Array
    (
        [0] => 最適な 
        [1] => ツール
    )
    Word count: 2
    

    Note: I had to add 2 spaces between characters to get a correct count Fixed by setting mb_regex_encoding() & mb_internal_encoding() to UTF-8

    However, in Chinese the concept of "words" doesn't exist (and may too in Japanese in some case), so you may never get a pertinent result in such way...)

    You may need to write an algorithm using a dictionnary to determine which groups of characters is a "word"