Search code examples
mysqldatabaseutf-8database-migrationiso-8859-1

Are there concerns if we have a database with mix of character set and collations?


we are currently in the process of migrating latin1 databases to UTF8 in a MySQL Server.

We are investigating what is the best approach and hopefully without any downtime.

My question is if we will migrate a table to UTF8 in a latin1 database and for example, we have a query that performs actions on these two tables (let's say join for example)

is there any chance we will start getting not expected results because of the different collations/characters set?

I guess migrating the entire database will be safer but will involve downtime.

thank you.

Edit:

We are looking at our data, we see that 99% of the data is windows cp1252 encoded data, which means it will be safe to convert the table from latin1 to utf8 without being afraid of corrupting it.

So as I see it we have three options:

  1. Downtime and convert the entire database
  2. No downtime and convert each table with a risk of performance and some errors in production (due to join queries and incompatible collations)
  3. use pt-online-schema-change (or something similar)

Solution

  • Yes and no.

    Separate columns in a table can have different charsets and collations. Separate rows cannot.

    Tables are independent until you JOIN. At that point, performance suffers if the column(s) you are Joining do not have the same charset and collation.

    When inserting/fetching data, your client has a single charset. If that is utf8mb4, then MySQL will happily convert characters on the fly between that and what is in the table columns. Of course, some characters cannot be converted; for example, latin1 has far fewer distinct characters than utf8mb4.

    You mentioned migration. You will need to either dump and reload (slower) or do lots of ALTERs (faster, but not "fast" for huge tables) or do something else. Please state the parameters you can live with. There are tradeoffs among speed, simplicity, etc.

    A lot of people get in trouble when changing charsets. There are about 7 different ALTERs that one is tempered to use. If you perform the wrong one, it makes a mess that is really hard to unravel.

    If you try to keep the database "live" while changing it, you could hit the JOIN performance problem mentioned above.

    If you already have replication set up, there may be some techniques that take advantage of a "failover".

    Keep the following Q&A at hand in case you encounter some gibberish: Trouble with UTF-8 characters; what I see is not what I stored

    What version of MySQL are you using? If it 8.0, then default charset (utf8mb4) and its default collation are probably best.