Search code examples

Reduce a UTF-8 string for binary comparison

I want to quickly check if an UTF-8 word exists as an array key.

The words may have:

  • different case
  • accented characters or not
  • different Unicode normalization forms

I can use mb_strtolower() to make them both lowercase, and Normalizer::normalize() to normalize the strings. This checks the first 2 bullet points, but does not handle accents:

'tést' !== 'test'

I can use Collator to compare both words:

$collator = new Collator('fr_FR');
$collator->compare('tést', 'test'); // 0

This checks my 3 bullet points, but I now I have to loop over all my word pairs to compare them, when I want to be able to perform a binary lookup as an array key (I have many lookups to perform on a big dictionary).

What I want is:

function reduce($word) {
    // how?

// prepare the dictionary (once)

$dictionary = [];

foreach ($dictionaryWords as $dictionaryWord) {
    $dictionary[reduce($dictionaryWord)] = true;

// perform a lookup (many times)

if (isset($dictionary[reduce($lookupWord)])) {
    // it's a match!

Basically, I want the reduce() function (which may be poorly named) to perform a simplification like this one:

  • 'TÈST' => 'test'
  • 'Straße' => 'strasse'

I believe MySQL does something like this internally for its text indexes.

Is there an intl function that does this? The list of intl classes and functions is hard to digest.


  • What I'm looking for is the Transliterator class. An example can be found in this answer:

    $string = "Fóø Bår";
    $transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
    echo $transliterator->transliterate($string); // foo bar

    Thanks to @Pete for the pointer in the comments.

    This even works with non-european characters:

    echo $transliterator->transliterate('Fóø Bår 学中文'); foo bar xue zhong wen

    Where iconv would fail at the job:

    echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', 'Fóø Bår 学中文'); // Foo Bar ???

    Unless I'm missing some other iconv options, of course.