I want to quickly check if an UTF-8 word exists as an array key.
The words may have:
I can use mb_strtolower()
to make them both lowercase, and Normalizer::normalize()
to normalize the strings. This checks the first 2 bullet points, but does not handle accents:
'tést' !== 'test'
I can use Collator
to compare both words:
$collator = new Collator('fr_FR');
$collator->setStrength(Collator::PRIMARY);
$collator->compare('tést', 'test'); // 0
This checks my 3 bullet points, but I now I have to loop over all my word pairs to compare them, when I want to be able to perform a binary lookup as an array key (I have many lookups to perform on a big dictionary).
What I want is:
function reduce($word) {
// how?
}
// prepare the dictionary (once)
$dictionary = [];
foreach ($dictionaryWords as $dictionaryWord) {
$dictionary[reduce($dictionaryWord)] = true;
}
// perform a lookup (many times)
if (isset($dictionary[reduce($lookupWord)])) {
// it's a match!
}
Basically, I want the reduce()
function (which may be poorly named) to perform a simplification like this one:
I believe MySQL does something like this internally for its text indexes.
Is there an intl
function that does this? The list of intl
classes and functions is hard to digest.
What I'm looking for is the Transliterator class. An example can be found in this answer:
$string = "Fóø Bår";
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $transliterator->transliterate($string); // foo bar
Thanks to @Pete for the pointer in the comments.
This even works with non-european characters:
echo $transliterator->transliterate('Fóø Bår 学中文'); foo bar xue zhong wen
Where iconv
would fail at the job:
echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', 'Fóø Bår 学中文'); // Foo Bar ???
Unless I'm missing some other iconv
options, of course.