Search code examples
phpmysqlpdo

Can get emojis to show in PHP with MySQL 5.x, but not 8.x


I have a PHP 7.3 project that's currently using MySQL 5.5, with utf8 tables. Some of the tables contain emoji data, which show up fine in the current project. I'm trying to update the project to MySQL 8.x, but when I do, emoji data shows up incorrectly.

First, I updated all the 5.5 tables to use uf8mb4. In this state, the data showed up. I then updated to 5.7, and things continued to work. I dumped this data, updated to 8.0, and reloaded it (I did use the --default-character-set=utf8mb4 flag on both dump and load), and then the data stopped showing up correctly, for example a lightbulb showing up as 💡.

I am running each of these services in docker. I was able to update from 5.5 to 5.7 using the same data volume without issue, but when trying to upgrade from 5.7 to 8.0, I got errors I was unable to resolve, and ended up doing a data dump/restore.

An example table with a field with an emoji:

DROP TABLE IF EXISTS `forums`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `forums` (
  `forumID` int(11) NOT NULL AUTO_INCREMENT,
  `title` varchar(200) COLLATE utf8_unicode_ci NOT NULL,
  `description` text COLLATE utf8_unicode_ci,
  `forumType` varchar(1) COLLATE utf8_unicode_ci DEFAULT 'f',
  `parentID` int(11) DEFAULT NULL,
  `heritage` varchar(25) COLLATE utf8_unicode_ci NOT NULL,
  `order` int(5) NOT NULL,
  `gameID` int(11) DEFAULT NULL,
  `threadCount` int(11) NOT NULL,
  PRIMARY KEY (`forumID`),
  UNIQUE KEY `heritage` (`heritage`),
  KEY `parentID` (`parentID`)
) ENGINE=MyISAM AUTO_INCREMENT=11551 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
/*!40101 SET character_set_client = @saved_cs_client */;

--
-- Dumping data for table `forums`
--

LOCK TABLES `forums` WRITE;
/*!40000 ALTER TABLE `forums` DISABLE KEYS */;
INSERT INTO `forums` VALUES (8003,'💡 Gamers\' Plane development',NULL,'f',2,'0002-8003',3180,3181,4);
/*!40000 ALTER TABLE `forums` ENABLE KEYS */;
UNLOCK TABLES;

To update the table to utf8mb4 I did

ALTER TABLE forums CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

How I'm testing the return of that data:

<?php
$mysql = new PDO("mysql:host=mysql;dbname=gamersplane", 'gamersplane', 'mypass');
$mysql->setAttribute(PDO::ATTR_DEFAULT_FETCH_MODE, PDO::FETCH_ASSOC);
$mysql->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

$forum = $mysql->query('select * from forums where forumID = 8003')->fetch();
?>
<html>
<header>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</header>
<body>
<?php print_r($forum['title']); ?>
</body>
</html>

I've read [UTF-8 all the way through][1] and

[1]: UTF-8 all the way through and

  • am using utf8mb4 in the database
  • have charset=utf8mb4 in my PDO string
  • have default_charset explicitly set in my php.ini as well as tried setting it at runtime
  • have tried Content-Type: text/html; charset=utf-8 as a PHP header, as well as a HTML metatag

Solution

  • Encoding in DataBases is always a lot of fun! Unfortunately when you change the character set it doesn't update the data, only how the database interprets the data, as well MySQL doesn't perform encoding change on the fly, and always writes down bytes as they are coming from the client. From the example you can see, that 💡 is the latin1 representation of the 💡, and when you dump the data it dumps it already in incorrect encoding.

    To verify the issue you can try to convert the data with the query:

    SELECT
      CONVERT(BINARY(CONVERT(title USING latin1)) USING utf8mb4)
    FROM forums 
    WHERE id = 8003;
    

    in your latest MySQL8 environment, it should display emojis correctly. If so, you should try to dump data again, and this time use the charset it was encoded originally, most likely latin1 using --default-character-set=latin1. The dump file should contain emojis instead of 💡-like text.

    Be aware, that if you have new content in the table, it will be double encoded, or the dump fill will fail, if new text is not compatible with the latin1 encoding, it would be better to do it with the original set, if you still have access to it.