We have a MySQL database containing a table of authors. Some of the authors names have non-English characters in them (example LÜTTGE).
Our client wants users to be able to find such records even if they don't enter the non-English character. So in the above example "LUTTGE" should also find that result. At the moment it only works if the user searches for the name using the non-English character, so "LÜTTGE" works but "LUTTGE" returns nothing.
The frontend to this is a web application written in CakePHP 2
Does anyone have any ideas on how to do this as I'm at a loss? Ideally we want to be able to do this within CakePHP/MySQL, and not use third party search systems.
The above is just one example in a table of thousands of records. So it's not just a case of substituting "U" with "Ü" - there are many other variants.
This can be handled by using the MySQL collation system.
For example, the following query returns a true
(1
) value:
SELECT 'LÜTTGE' COLLATE utf8_general_ci = 'LUTTGE'
Accordingly, if you set the column's character set to utf8
and its collation to utf8_general_ci
you will get the result you mention with umlaut characters.
The default collation in MySQL reflects its Swedish origin and is utf8_swedish_ci. In Swedish, Ü and U are not the same letter. You probably have used the default collation for your columns.
The utf8_general_ci
collation handles matching 'Eßen' to 'Esen' but not to 'Essen'. It handles matching 'LÜTTGE' to 'LUTTGE' but not to 'Luettge', unfortunately.
On the other hand, the utf8_german2_ci
collation matches 'Eßen' to 'Essen' and 'LÜTTGE' to 'LUETTGE'. If your users are accustomed to using ASCII transliterations of German characters you may wish to explore your choices here. One of them is to use a query with OR
SELECT whatever
FROM table
WHERE ( namen COLLATE utf8_general_ci = 'LUTTGE'
OR namen COLLATE utf8_german2_ci = 'LUTTGE' )
It can get more complex if you need to handle Spanish, because Ñ
is considered a different letter from N
. You may need to do some explaining for your users.
Marcus suggested using the utf_unicode_ci
collation. That will handle things partially too. Here are the cases
type utf8_general_ci utf8_german2_ci utf8_unicode_ci utf8_spanish_ci
'Eßen' to 'Esen' substitute match no match no match no match
'Eßen' to 'Essen' transliterate no match match match match
'LÜTTGE' to 'LUTTGE' substitute match no match match match
'LÜTTGE' to 'LUETTGE' transliterate no match match no match no match
'Niño' to 'Nino' transliterate match match match no match
So you still need some extra work to handle transliterations.