Search code examples
phpmysqldrupalutf-8

UTF8 characters not printed as such in Drupal's HTML


I am trying to debug a nasty utf-8 problem, and do not know where to start.

A page contains the word 'categorieën', which should be categorieën. Clearly something is wrong with the UTF-8. This happens with all these multibyte characters. I have scanned the gazillion topics here on UTF8, but they mostly cover the basics, not this situation where everything appears to be configured and set correct, but clearly is not.

The pages are served by Drupal, from a MySQL database.

The database was migrated (not by me) by sql-dumping and -importing through phpmyadmin. Good chance something went wrong there, because before, there was no problem. And because the problem occurs only on older, imported items. Editing these items or inserting new ones, and fixing the wrongly encoded characters by hand, fixes the problem. Though I cannot see a difference in the database.

  • Content re-edited through Drupal does not have this problem.
  • When, on the CLI, using MySQL, I can read out that text and get the correct ë character. On both The articles that render "correct "and "incorrect" characters.
  • The tables have collation utf8_general_ci
  • Headers appear to be sent with correct encoding: Vary Accept-Encoding and Content-Type text/html; charset=utf-8
  • HTML head contains a <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  • the HTTP headers tell me there is a Varnish proxy in-between. Could that cause UTF8-conversion/breakage
  • content is served Gzipped, normal in Drupal, and I have never seen this UTF8 issue with regard to the gzipping, but you never know.

It appears the import is the culprit and I would like to know a) what went wrong. b) why I cannot see a difference in the MySQL cli client between "wrong" and "correct" characters c) how to fix the database, or where to start looking and learning on how to fix it.


Solution

  • The dump file was probably output as UTF-8, but interpreted as latin1 during import.

    The ë, the latin1 two-byte representation of UTF-8's ë, is physically in your tables as UTF-8 data.

    Seeing as you have a mix of intact and broken data, this will be tough to fix in a general way, but usually, this dirty workaround* will work well:

    UPDATE table SET column = REPLACE("ë", "ë", column);
    

    Unless you are working with languages other than dutch, the range of broken characters should be extremely limited and you might be able to fix it with a small number of such statements.

    Related questions with the same problem:

    * (of course, don't forget to make backups before running anything like this!)