Search code examples
phpmysqlutf-8utf8mb4flourishlib

How to support emojis with flourish?


I am using flourishlib for a website. My client requested that we should be able to use emojis with mobile phones. In theory we should change the character-encoding from utf8 to utf8mb4 for the MySQL database.

So far, so good, however, if we make this switch, like this:

# For each database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE utf8mb4_unicode_ci;
# For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# For each column:
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# (Don’t blindly copy-paste this! The exact statement depends on the column type, maximum length, and other properties. The above line is just an example for a `VARCHAR` column.)

Then each character will use four bytes instead of three bytes. This would increase the database's size with 33%. This would result in worse performance and more storage space used up. So, as a result, we have decided to switch to an encoding of utf8mb4 for only specific columns of specific tables.

To make sure everything is all right, I have checked several things. Among them, I have checked flourishlib and found a few suspect parts:

  1. There is an fUTF8 class, which does not seem to support utf8mb4

  2. At fDatabase I am quoting some findings:

    if ($this->connection && function_exists('mysql_set_charset') && !mysql_set_charset('utf8', $this->connection)) {
        throw new fConnectivityException(
            'There was an error setting the database connection to use UTF-8'
        );
    }
    //...
    // Make MySQL act more strict and use UTF-8
    if ($this->type == 'mysql') {
        $this->execute("SET SQL_MODE = 'REAL_AS_FLOAT,PIPES_AS_CONCAT,ANSI_QUOTES,IGNORE_SPACE'");
        $this->execute("SET NAMES 'utf8'");
        $this->execute("SET CHARACTER SET utf8");
    }
    
  3. At fSQLSchemaTranslation I can see this:

    $sql = preg_replace('#\)\s*;?\s*$#D', ')ENGINE=InnoDB, CHARACTER SET utf8', $sql);
    

I have the suspicion that flourishlib will not support our quest of making a few columns of a few table have a character encoding of utf8mb4. I wonder whether we can upgrade something somehow to make this support. As a worst-case scenario, we can override every textual occurrence of utf8 to utf8mb4. However, that would be a very ugly hack and we wonder whether there is a better solution. Should we make this hack or is there a more orthodox approach?


Solution

  • I have resolved the issue. I have altered the tables where I wanted to support emojis by changing the column character set and collation, like this:

    ALTER TABLE table_name CHANGE column_name column_name text CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
    

    After that, I had to make a few ugly hacks to make flourishlib able to support emojis.

    fDatabase.php:

    line 685:

            if ($this->connection && function_exists('mysql_set_charset') && !mysql_set_charset('utf8mb4', $this->connection)) {
                throw new fConnectivityException(
                    'There was an error setting the database connection to use UTF-8'
                );
            }
    

    line 717 stays the same, everything crashes if this line is changed:

    if ($this->connection && function_exists('mysqli_set_charset') && !mysqli_set_charset($this->connection, 'utf8')) {
    

    line 800:

        // Make MySQL act more strict and use UTF-8
        if ($this->type == 'mysql') {
            $this->execute("SET SQL_MODE = 'REAL_AS_FLOAT,PIPES_AS_CONCAT,ANSI_QUOTES,IGNORE_SPACE'");
            $this->execute("SET NAMES 'utf8mb4'");
            $this->execute("SET CHARACTER SET utf8mb4");
        }
    

    fSQLSchemaTranslation.php:

    line 1554:

    $sql = preg_replace('#\)\s*;?\s*$#D', ')ENGINE=InnoDB, CHARACTER SET utf8mb4', $sql);
    

    fXML.php:

    line 403:

        if (preg_replace('#[^a-z0-9]#', '', strtolower($encoding)) == 'utf8mb4') {
            // Remove the UTF-8 BOM if present
            $xml = preg_replace("#^\xEF\xBB\xBF#", '', $xml);
            fCore::startErrorCapture(E_NOTICE);
            $cleaned = self::iconv('UTF-8', 'UTF-8', $xml);
            if ($cleaned != $xml) {
                $xml = self::iconv('Windows-1252', 'UTF-8', $xml);
            }
            fCore::stopErrorCapture();
        }
    

    and finally, when there are modifications for any of the columns affected, I execute this:

    App::db()->query("set names 'utf8mb4'");
    

    which, essentially triggers the ->query() execution of an fDatabase object.