I'm searching for a method to remove diacritics and other letter marks in a text and simplify it in a way that it is a good fit for a text search index.
For removing the diacritics, I already found these:
I was wondering about a generic solution, language independent. (Also, this reference list might be useful for some.)
Removing the diacritics works for äöüò, etc. But I also want:
For example, I want to index the name Røyksopp which sometimes also occurs as Röyksopp just under the simplified name Royksopp. Or KoЯn should be KoRn.
Some ICU magic:
echo "ë ö ø Я Ł ɲ æ å ñ 開 당" | uconv -x any-name | perl -wpne 's/ WITH [^}]+//g;' | uconv -x name-any | uconv -x any-latin -t iso-8859-1 -c | uconv -f iso-8859-1 -t ascii -x latin-ascii -c
yields
e o o A L n ae a n ki dang
This uses the cmdline tool uconv, but the same can be done with ICU's Java or C or C++ API, and ICU has bindings for almost any language.
Note Я -> A because that is the correct behavior. What you want is not how Unicode defines that character - blame KoЯn for abusing it.