Search code examples
mysqlutf-8iso-8859-1

Detailed instructions on converting a MYSQL DB and its data from latin to UTF-8. Too much diff info out there


Can you someone please provide the best way to convert not only a mysql database and all its tables from latin1_swedish_ci to UTF-8, with their contents? I have been researching all over Stackoverflow as well as elsewhere and the suggestions are always different.

  • Some people suggest just using these commands on the tables and databases:

    ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;
    ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
    
  • Others say that this just changes the database and tables, but not the contents.

  • Some suggest dumping the db, create a new table with the right char set and collation, and importing the old db into that. Does this actually convert the data as well?

    mysqldump --skip-opt --set-charset --skip-set-charset
    
  • Others suggest running iconv against the dumped DB before importing? Is this really needed or would the import into a UTF-8 db do the conversion?

Finally, other suggest altering the database, convert char/blog tables to binary, and the converting back.

There are so many different methods that it has become very confusing.

Can someone please provide a concise step-by-step instruction, or point me to one, on how I can go about convert my latin DBs and their content to UTF-8? Even better if there is a script that automates this process against a database.

Thanks in advance.


Solution

  • The are two different problems which are often conflated:

    1. change the specification of a table or column on how it should store data internally
    2. convert garbled mojibake data to its intended characters

    Each text column in MySQL has an associated charset attribute, which specifies what encoding text stored in this column should be stored as internally. This only really influences what characters can be stored in this column and how efficient the data storage is. For example, if you're storing a ton of Japanese text, sjis as an encoding may be a lot more efficient than utf8 and save you a bit of disk space.

    The column encoding does not in any way influence in what encoding data is input and output to/from the database. This is a separate setting, the connection encoding, which is established for every individual client every time you connect to the database. MySQL will convert data on the fly between the connection encoding and the column/table charset as needed. You can connect to the database with a utf8 connection, send it Japanese text destined for an sjis column, and MySQL will convert from utf8 to sjis on the fly (and back in reverse on the way out).

    Now, if you've screwed up the connection encoding (as happens way too often) and you've inserted text in a different encoding than your connection encoding specified (e.g. your connection encoding was latin1 but you actually sent UTF-8 encoded data), then you're storing garbage in your database and you need to recover that. If that's your issue, see How to convert wrongly encoded data to UTF-8?.

    However, if all your data is peachy and all you want to do is tell MySQL to store data in a different encoding from now on, you only need this:

    ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
    

    MySQL will convert the current data from its current charset to the new charset and store future data in the new charset. That's all.