Search code examples
mediawiki

How to make MediaWiki search ignore accents?


I'm running a MediaWiki instance that I just upgraded to the latest version at the time of this writing, 1.32.0. This wiki is nearly 10 years old and has gone through a number of upgrades.

It's a wiki in French language, and something annoying for French speakers is that the built-in search has always considered accented characters different from their non-accented counterparts, version after version.

For example, searching for Aromathérapie returns a number of results, while searching for Aromatherapie returns 0 results.

I thought that this was a database collation issue at first, until I noticed that the searchindex table is actually populated with ASCII-encoded UTF-8 words. Taking the example above, aromathérapie is stored as aromathu8c3a9rapie, so changing the table collation does not help.

Digging through the source code, I found the SearchMySQL::normalizeText() method that is responsible for this encoding.

And as far as I can see, the only normalization that this method does prior to encoding is lowercasing:

MediaWikiServices::getInstance()->getContentLanguage()->lc( $out )

So as it stands, it looks like there is no way to make the built-in search ignore accents.

I googled quite a lot for solutions, and found mostly old, unrelevant threads. I'm really surprised to not find more literature on the subject.

How can I make the MediaWiki search case- AND accents- insensitive?


Solution

  • I'm not proud of it, but here's how I solved it, using MySQL's built-in support for collations (which does work with fulltext indexes—at least in recent versions of MySQL—contrary to what the code says):

    • Converted the searchindex table to utf8mb4:
      ALTER TABLE searchindex CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
    • Applied this patch to includes/search/SearchMySQL.php:
      • no lowercasing, no replacing of UTF-8 chars with their hex-encoded counterpart
      • unicode u flag in preg_replace()
    • Rebuilt the searchindex table: php maintenance/rebuildtextindex.php

    A similar procedure will have to be applied whenever the MediaWiki installation is updated, which adds to the maintenance cost. The procedure being simple, it's a cost I'm willing to accept right now.

    A final note is that this does not make the autocompletion work case-insensitively, only the search results. This is good enough for me for now.