Search code examples
mysqlsearchcakephp-2.0search-enginecollation

Get same MySQL search results using foreign and English characters


We have a MySQL database containing a table of authors. Some of the authors names have non-English characters in them (example LÜTTGE).

Our client wants users to be able to find such records even if they don't enter the non-English character. So in the above example "LUTTGE" should also find that result. At the moment it only works if the user searches for the name using the non-English character, so "LÜTTGE" works but "LUTTGE" returns nothing.

The frontend to this is a web application written in CakePHP 2

Does anyone have any ideas on how to do this as I'm at a loss? Ideally we want to be able to do this within CakePHP/MySQL, and not use third party search systems.

The above is just one example in a table of thousands of records. So it's not just a case of substituting "U" with "Ü" - there are many other variants.


Solution

  • This can be handled by using the MySQL collation system.

    For example, the following query returns a true (1) value:

    SELECT 'LÜTTGE' COLLATE utf8_general_ci = 'LUTTGE'
    

    Accordingly, if you set the column's character set to utf8 and its collation to utf8_general_ci you will get the result you mention with umlaut characters.

    The default collation in MySQL reflects its Swedish origin and is utf8_swedish_ci. In Swedish, Ü and U are not the same letter. You probably have used the default collation for your columns.

    The utf8_general_ci collation handles matching 'Eßen' to 'Esen' but not to 'Essen'. It handles matching 'LÜTTGE' to 'LUTTGE' but not to 'Luettge', unfortunately.

    On the other hand, the utf8_german2_ci collation matches 'Eßen' to 'Essen' and 'LÜTTGE' to 'LUETTGE'. If your users are accustomed to using ASCII transliterations of German characters you may wish to explore your choices here. One of them is to use a query with OR

     SELECT whatever 
       FROM table
      WHERE (    namen COLLATE utf8_general_ci = 'LUTTGE'
              OR namen COLLATE utf8_german2_ci = 'LUTTGE'  )
    

    It can get more complex if you need to handle Spanish, because Ñ is considered a different letter from N. You may need to do some explaining for your users.

    Marcus suggested using the utf_unicode_ci collation. That will handle things partially too. Here are the cases

                           type        utf8_general_ci  utf8_german2_ci   utf8_unicode_ci utf8_spanish_ci
    'Eßen' to 'Esen'       substitute      match            no match        no match         no match
    'Eßen' to 'Essen'      transliterate   no match         match           match            match
    'LÜTTGE' to 'LUTTGE'   substitute      match            no match        match            match
    'LÜTTGE' to 'LUETTGE'  transliterate   no match         match           no match         no match
    'Niño'   to 'Nino'     transliterate   match            match           match            no match
    

    So you still need some extra work to handle transliterations.