mysql utf-8 non-ascii-characters turkish utf8mb4

How to find all variations (accented, etc) of a searched string in MySQL table?

I have a big MySQL, InnoDB table which holds data in utf8mb4_unicode_ci. I have records from almost every languages those are based on latin alphabet. I am easily able to search words with accented characters and find results with different variations (accented/not-accented), but there is a problem with the Turkish language. A basic LIKE keyword does not work in the SQL line when you specify a word that includes a big I letter. I expect to see all words those include the lower case version of that letter too (which is dotless i).

So, Iis i in English, but it is dotless i in Turkish. (ı)

SELECT * FROM words WHERE word LIKE 'SIR' does not bring the word sır word. It only brings sir word but they both exist in the same table.

Have you ever had any problem like this? How to solve it with SQL?

update: SELECT * FROM words WHERE word LIKE 'SIR' COLLATE utf8mb4_turkish_ci returns sır but not sir. Even though this could solve some of the problems, unfortunately, it takes 23 seconds to return from this query instead of 0.00003 seconds (without COLLATE keyword)

Solution

Index or scan?

In your example, I assume word was indexed? And the table is rather large? In that case, an index is very powerful for finding the row(s) that match for the specified collation. Any other collation (your case with COLLATE ...) requires ignoring the index and scanning the entire table (slow).

Using REGEXP will also require a full table scan.
Doing REPLACE(...) on the fly will require a full table scan.

So, given that there are "too many" rows to allow for a full table scan, there is no simple way to search for sır or sir for arbitrary collations. Let's look for another way...

First, let's look at some collation cases.

In most utf8 collations, including utf8_unicode_ci, I=i=Ì=Í=Í=Î=Ï=ì=í=í=î=ï=Ī=ī=Į=į=İ
In utf8_turkish_ci, I=ı, but they come before i=Ì=Í=Í=Î=Ï=ì=í=í=î=ï=Ī=ī=Į=į=İ (and hence, explains part of your problem).
Most other utf8 collations treat ı as coming between i and j.
utf8_general_ci is slightly different: I=i=Ì=Í=Í=Î=Ï=ì=í=í=î=ï=Ī=ī=Į=į=İ=ı
utf8_icelandic_ci treats some accented i's as a different letter: I=i=Ì=Î=Ï=ì=î=ï=Ī=ī=Į=į=İ, but Í=Í=í=í fall between I and J. That one also sorts these after Z: Ä=Æ=ä=æ Ö=Ø=ö=ø Å=å

See my Reference

Another problem: Accented letters can be presented as two utf8 characters - the letter, then a "non-spacing mark"

So, what to do??

Plan A: Build your own collation. This takes some work, but is not impossible. This would be optimal, giving you the best performance. As we will see below, utf8_unicode_520_ci comes very close, and could be a starting point.

Plan B: Add another column in your table; let's call it wordx. It would have word, but will tall accents stripped. Then index wordx instead of word. So the row with word='sır' would have wordx='sir'.

Plan C: Using the equivalences discussed above, pick one type if "I" from each group for the collation defined in the table, then use utf8_unicode_520_ci with UNION:

( SELECT ... WHERE word = 'sir' )
UNION ALL
( SELECT ... WHERE word = 'sır' )

This should catch all flavors of i.

Oops, what about words with multiple different accentable letters? At first glance, maybe it is not a problem -- all a collate equal in utf8_unicode_520_ci. Glancing through the rest of the letters, I don't see anything other than ı giving a problem.

Here are the interesting equivalences for utf8_unicode_520_ci:

A=a=ª=À=Á=Á=Â=Ã=Ä=Å=à=á=á=â=ã=ä=å=Ā=ā=Ą=ą  Aa  ae=Æ=æ  az  B=b  C=c=Ç=ç=Č=č  ch  cz
D=d=Ð=ð=Ď=ď  dz  E=e=È=É=É=Ê=Ë=è=é=é=ê=ë=Ē=ē=Ĕ=ĕ=Ė=ė=Ę=ę=Ě=ě  F=f  fz  ƒ
G=g=Ğ=ğ=Ģ=ģ  gz  H=h  hz  I=i=Ì=Í=Í=Î=Ï=ì=í=í=î=ï=Ī=ī=Į=į=İ  ij=ĳ  iz  ı  J=j
K=k=Ķ=ķ  L=l=Ĺ=ĺ=Ļ=ļ=Ł=ł  lj=Ǉ=ǈ=ǉ  ll  lz  M=m  N=n=Ñ=ñ=Ń=ń=Ņ=ņ=Ň=ň  nz
O=o=º=Ò=Ó=Ó=Ô=Õ=Ö=Ø=ò=ó=ó=ô=õ=ö=ø  oe=Œ=œ  oz  P=p  Q=q  R=r=Ř=ř  S=s=Ş=ş=Š=Š=š=š
sh  ss=ß  sz  T=t=Ť=ť  TM=tm=™  tz  U=u=Ù=Ú=Ú=Û=Ü=ù=ú=ú=û=ü=Ū=ū=Ů=ů=Ų=ų  ue  uz  V=v
W=w  X=x  Y=y=Ý=Ý=ý=ý=ÿ=Ÿ  yz  Z=z=Ž=Ž=ž=ž  zh  zz  Þ=þ  µ

If you consider ƒ to be an accented letter, then you might want to deal with it, too.

Your question is interesting.