Search code examples
phpmysqlpdoutf-8utf8mb4

Can I safely use a utf8mb4 connection with utf8 columns?


I have some MySQL tables with utf8mb4 fields, and others with utf8.

It is safe to use utf8mb4 in the PDO connection string for all tables? Or do I have to convert everything to utf8mb4, or start two different PDO connections?


EDIT: The question is not "can I store 4-byte characters into utf8 columns?" We already know we can't, that doesn't depend on the connection, so if a column is utf8 it means it will not receive 4 bytes characters, for example country or currency codes, email addresses, usernames... where the input is validated by the application.


Solution

  • This can be tested quite easily with the following script:

    <?php
    
    $pdo = new PDO('mysql:host=localhost;dbname=test', 'test', '');
    
    $pdo->exec("
        drop table if exists utf8_test;
        create table utf8_test(
            conn varchar(50) collate ascii_bin,
            column_latin1  varchar(50) collate latin1_general_ci,
            column_utf8    varchar(50) collate utf8_unicode_ci,
            column_utf8mb4 varchar(50) collate utf8mb4_unicode_ci
        );
    ");
    
    $latin = 'abc äŒé';
    $utf8  = '♔♕';
    $mb4   = '🛃 🔣';
    
    $pdo->exec("set names utf8");
    
    $pdo->exec("
        insert into utf8_test(conn, column_latin1, column_utf8, column_utf8mb4)
         values ('utf8', '$latin', '$latin $utf8', '$latin $utf8 $mb4')
    ");
    
    $pdo->exec("set names utf8mb4");
    
    $pdo->exec("
        insert into utf8_test(conn, column_latin1, column_utf8, column_utf8mb4)
         values ('utf8mb4', '$latin', '$latin $utf8', '$latin $utf8 $mb4')
    ");
    
    $result = $pdo->query('select * from utf8_test')->fetchAll(PDO::FETCH_ASSOC);
    
    var_export($result);
    

    And this is the result:

    array (
      0 => 
      array (
        'conn' => 'utf8',
        'column_latin1' => 'abc äŒé',
        'column_utf8' => 'abc äŒé ♔♕',
        'column_utf8mb4' => 'abc äŒé ♔♕ ???? ????',
      ),
      1 => 
      array (
        'conn' => 'utf8mb4',
        'column_latin1' => 'abc äŒé',
        'column_utf8' => 'abc äŒé ♔♕',
        'column_utf8mb4' => 'abc äŒé ♔♕ 🛃 🔣',
      ),
    )
    

    As you can see, we can not use utf8 as connection charset, when we work with utf8mb4 columns (see ????). But we can use utf8mb4 for connection when working with utf8 columns. Also neither has problems writing to and reading from latin or ascii columns.

    The reason is that you can encode any utf8, latin or ascii character in utf8mb4 but not the other way around. So using utf8mb4 as character set for connection is safe in this case.