Search code examples
phpmysqlmysqliencodingheidisql

How to clean up encoding in MySQL table


The system is a PHP application accessing a MySQL database. The first tables were created with the standard latin1 encoding and filled via mysqli in PHP, without setting any encoding. The PHP scripts and data all work in UTF-8.

The newer tables have set the encoding to utf8_bin and in front of every transaction SET CHARACTER SET utf8 is sent.

If I look at the newer tables in the mysql database (via a sql explorer like HeidiSQL) every special character is displayed correctly. However, in every older table the typical latin1-utf8 errors are visible like Ãœ for ü.

Is there an easy way to fix this like in one of the following ways:

  • Fix the encoding for each table so that is displayed correctly in the sql explorer, but keep the PHP code as it is (with SET CHARACTER SET utf8 or latin1 in front of every statement, fitting the tables encoding. (merely a workaround)
  • Switch the encoding to utf8 for all tables -> now SET CHARACTER SET utf8 has to be sent at the beginning of every mysqli connection (or maybe there is a way to set this as standard?)
  • Switch the encodung to latin1 for all tables -> no need to send SET CHARACTER SET utf8 in front of transactions anymore, but wrong encoding in the database explorer.

It seems as if the Database takes all tables as utf8 and shows latin1 tables therefore with wrong characters. Mysqli takes all tables as latin1 if not told different.

The application is productive, the encoding problem is not visible to the user, as the right encoding is told to mysqli in front of every statement. But I feel like that is not a good practice.

I recognize that there is something wrong with how the database is set up, and I hope to learn what is the best practice to fix this.


Solution

  • The first tables were created with the standard latin1 encoding and filled via mysqli in PHP

    So, those tables have garbage. Do you need to recover the data?

    in front of every transaction SET CHARACTER SET utf8 is sent.

    That's useless. Instead, do SET NAMES utf8 once after connecting.

    Ãœ is Mojibake for Ü. Search this forum for Mojibake.

    Having some tables latin1 with utf8 bytes, while other tables are utf8 -- that will be a painful mess going forward. Stop now and clean up the mess. Go utf8 all the way.

    Summary of Mojibake:

    • The bytes you have in the client are correctly encoded in utf8 (good).
    • You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
    • xx The column in the table was declared CHARACTER SET latin1. (Or possibly it was inherited from the table/database.) (It should have been utf8.)
    • The column in the tables may or may not have been CHARACTER SET utf8, but it should have been that.

    If you need to fix for the data it takes a "2-step ALTER", something like

    ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
    ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;