I am trying to debug a nasty utf-8 problem, and do not know where to start.
A page contains the word 'categorieën', which should be categorieën. Clearly something is wrong with the UTF-8. This happens with all these multibyte characters. I have scanned the gazillion topics here on UTF8, but they mostly cover the basics, not this situation where everything appears to be configured and set correct, but clearly is not.
The pages are served by Drupal, from a MySQL database.
The database was migrated (not by me) by sql-dumping and -importing through phpmyadmin. Good chance something went wrong there, because before, there was no problem. And because the problem occurs only on older, imported items. Editing these items or inserting new ones, and fixing the wrongly encoded characters by hand, fixes the problem. Though I cannot see a difference in the database.
utf8_general_ci
Vary Accept-Encoding
and Content-Type text/html; charset=utf-8
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
It appears the import is the culprit and I would like to know a) what went wrong. b) why I cannot see a difference in the MySQL cli client between "wrong" and "correct" characters c) how to fix the database, or where to start looking and learning on how to fix it.
The dump file was probably output as UTF-8, but interpreted as latin1 during import.
The ë
, the latin1 two-byte representation of UTF-8's ë
, is physically in your tables as UTF-8 data.
Seeing as you have a mix of intact and broken data, this will be tough to fix in a general way, but usually, this dirty workaround* will work well:
UPDATE table SET column = REPLACE("ë", "ë", column);
Unless you are working with languages other than dutch, the range of broken characters should be extremely limited and you might be able to fix it with a small number of such statements.
Related questions with the same problem:
* (of course, don't forget to make backups before running anything like this!)