I have some MySQL tables with utf8mb4 fields, and others with utf8.
It is safe to use utf8mb4 in the PDO connection string for all tables? Or do I have to convert everything to utf8mb4, or start two different PDO connections?
EDIT: The question is not "can I store 4-byte characters into utf8 columns?" We already know we can't, that doesn't depend on the connection, so if a column is utf8 it means it will not receive 4 bytes characters, for example country or currency codes, email addresses, usernames... where the input is validated by the application.
This can be tested quite easily with the following script:
<?php
$pdo = new PDO('mysql:host=localhost;dbname=test', 'test', '');
$pdo->exec("
drop table if exists utf8_test;
create table utf8_test(
conn varchar(50) collate ascii_bin,
column_latin1 varchar(50) collate latin1_general_ci,
column_utf8 varchar(50) collate utf8_unicode_ci,
column_utf8mb4 varchar(50) collate utf8mb4_unicode_ci
);
");
$latin = 'abc äŒé';
$utf8 = '♔♕';
$mb4 = '🛃 🔣';
$pdo->exec("set names utf8");
$pdo->exec("
insert into utf8_test(conn, column_latin1, column_utf8, column_utf8mb4)
values ('utf8', '$latin', '$latin $utf8', '$latin $utf8 $mb4')
");
$pdo->exec("set names utf8mb4");
$pdo->exec("
insert into utf8_test(conn, column_latin1, column_utf8, column_utf8mb4)
values ('utf8mb4', '$latin', '$latin $utf8', '$latin $utf8 $mb4')
");
$result = $pdo->query('select * from utf8_test')->fetchAll(PDO::FETCH_ASSOC);
var_export($result);
And this is the result:
array (
0 =>
array (
'conn' => 'utf8',
'column_latin1' => 'abc äŒé',
'column_utf8' => 'abc äŒé ♔♕',
'column_utf8mb4' => 'abc äŒé ♔♕ ???? ????',
),
1 =>
array (
'conn' => 'utf8mb4',
'column_latin1' => 'abc äŒé',
'column_utf8' => 'abc äŒé ♔♕',
'column_utf8mb4' => 'abc äŒé ♔♕ 🛃 🔣',
),
)
As you can see, we can not use utf8
as connection charset, when we work with utf8mb4
columns (see ????
). But we can use utf8mb4
for connection when working with utf8
columns. Also neither has problems writing to and reading from latin
or ascii
columns.
The reason is that you can encode any utf8
, latin
or ascii
character in utf8mb4
but not the other way around. So using utf8mb4
as character set for connection is safe in this case.