Search code examples
phpicuintl

Reduce a UTF-8 string for binary comparison


I want to quickly check if an UTF-8 word exists as an array key.

The words may have:

  • different case
  • accented characters or not
  • different Unicode normalization forms

I can use mb_strtolower() to make them both lowercase, and Normalizer::normalize() to normalize the strings. This checks the first 2 bullet points, but does not handle accents:

'tést' !== 'test'

I can use Collator to compare both words:

$collator = new Collator('fr_FR');
$collator->setStrength(Collator::PRIMARY);
$collator->compare('tést', 'test'); // 0

This checks my 3 bullet points, but I now I have to loop over all my word pairs to compare them, when I want to be able to perform a binary lookup as an array key (I have many lookups to perform on a big dictionary).

What I want is:

function reduce($word) {
    // how?
}

// prepare the dictionary (once)

$dictionary = [];

foreach ($dictionaryWords as $dictionaryWord) {
    $dictionary[reduce($dictionaryWord)] = true;
}

// perform a lookup (many times)

if (isset($dictionary[reduce($lookupWord)])) {
    // it's a match!
}

Basically, I want the reduce() function (which may be poorly named) to perform a simplification like this one:

  • 'TÈST' => 'test'
  • 'Straße' => 'strasse'

I believe MySQL does something like this internally for its text indexes.

Is there an intl function that does this? The list of intl classes and functions is hard to digest.


Solution

  • What I'm looking for is the Transliterator class. An example can be found in this answer:

    $string = "Fóø Bår";
    $transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
    echo $transliterator->transliterate($string); // foo bar
    

    Thanks to @Pete for the pointer in the comments.

    This even works with non-european characters:

    echo $transliterator->transliterate('Fóø Bår 学中文'); foo bar xue zhong wen
    

    Where iconv would fail at the job:

    echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', 'Fóø Bår 学中文'); // Foo Bar ???
    

    Unless I'm missing some other iconv options, of course.