language-agnostic search full-text-search highlighting diacritics

How to write a php search script in which words with diacritics match search terms without diacritics, and the results are underlined?

I've got this site where there are lots of texts with diacritics in them (ancillary glyphs added to letters, according to wikipedia) and most people search these texts using words without the glyphs. Now it shouldn't be challenging to do this by having a copy of the texts without diacritics. However, I want to highlight the matches in the original text. What's the best way to do it?

Solution

You should try changing the collation setting in your MySQL DB.

There are three that seem to come up often in discussions of this topic:

utf8_general_ci
utf8_unicode_ci
utf8_bin ← You probably want this one.

I have found that #3 will match accents in search. This answer gives a bit of background on the differences, but it doesn't mention the fact that utf8_bin is also sensitive to accents. You might want to try all three so you can test for yourself if it's working with the language/script you're dealing with.

To be really sure that things are going to match correctly, you will have to look into Unicode Normalization as well, which is really a whole different ball of wax. It is possible that your user could type in a query with an accent in a different normalization from the one your data is stored in, and thus it might fail to match. I've had that problem with Sqlite, not sure if it applies to MySQL or not.

FWIW, here's a CREATE TABLE I'm currently using, where I needed to match accents, that is setting the COLLATION:

CREATE TABLE `glosses` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `morphemes` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
  `labels` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
  `phrase_id` int(11) DEFAULT NULL,
  `nth_word` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=8 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

You can see the COLLATE=utf8_bin tacked on at the end.