Search code examples
phpmysqllaravellumenilluminate-container

Lumen MySQL query not handling UTF8 value as expected


I'm working against a database that is using UTF8 encoding and has many user names that contain special characters, such as "Ғђ ▫ Sony"

When querying the user table, Lumen responds with incorrect data. I've tried querying the same table using mysqli and PDO and I receive the expected results. I set up a sample route to test it:

$app->get("charset", function() {
    $mysqli = new mysqli("localhost", "user", "password", "database");
    $res = $mysqli->query("select name from users where id = 1");

    $dbh = new PDO('mysql:host=localhost;dbname=database', "user", "password");
    $stmt = $dbh->query("select name from users where id = 1");

    $lumen = DB::select("select name from users where id = 1");

    return response()->json([
        "mysqli" => $res->fetch_assoc(),
        "pdo" => $stmt->fetchAll(PDO::FETCH_ASSOC),
        "framework" => $lumen
    ]);
});

When accessing the route, I get the following response:

{
  "mysqli": {
    "name": "Ғђ ▫ Sony"
  },
  "pdo": [
    {
      "name": "Ғђ ▫ Sony"
    }
  ],
  "framework": [
    {
      "name": "Ò’Ñ’ â–« Sony"
    }
  ]
}

Here's a screenshot of the response in case the text above does not display correctly: broken UTF8 response

As far as I can tell, Lumen's MySQL config defaults to UTF8 and is unchangeable - I found the following in vendor/laravel/lumen-framework/config/database:

'mysql' => [
    'driver'    => 'mysql',
    'host'      => env('DB_HOST', 'localhost'),
    'database'  => env('DB_DATABASE', 'forge'),
    'username'  => env('DB_USERNAME', 'forge'),
    'password'  => env('DB_PASSWORD', ''),
    'charset'   => 'utf8',
    'collation' => 'utf8_unicode_ci',
    'prefix'    => env('DB_PREFIX', ''),
    'timezone'  => env('DB_TIMEZONE','+00:00'),
    'strict'    => false,
],

I'm at a loss as to what could be causing this. What else can I do to try to track down this discrepancy?


Solution

  • This answer is based on my previous comments above.

    The MySQL connection charset defines what encoding is used for communication between the MySQL client (PHP) and server. It does not matter what encoding is used as the internal encoding in the actual MySQL tables. The MySQL server will automatically convert the data between the table encoding and connection encoding. So the connection encoding basically defines the format in which you expect to get the data from MySQL and also in which format you are inserting data to MySQL.

    Are you sure the data is correctly encoded in utf8 in the database?

    Seems like you are only using UTF8 for the lumen DB connection (if that is the default), but you do not use UTF8 with your mysqli or PDO connection examples. Do you get the same result if you set the mysqli charset using $mysqli->set_charset("utf8"); and PDO charset using new PDO('mysql:host=localhost;dbname=database;charset=utf8', "user", "password");?

    Based on your code and output example it would seem that you are correctly getting data in UTF8 from lumen DB connection but the output is not displayed as UTF8.

    This also explains why the mysqli and PDO output is shown correctly because they are not returning the data in UTF8 (because you have not set their connection charset to UTF8) but by default they seem to match whatever encoding it is you're displaying the output in (apparently "latin1" or compatible).

    If you are viewing the output in a web browser make sure the output page charset is defined correctly (e.g. using a header).

    Edit:

    Btw it should not matter what connection encoding is used in the other system that inserts the data as long as the connection encoding matches encoding of the data that is sent through the connection.

    Setting connection encoding to latin1 means you will get string data as latin1 when you do a SELECT. So it would seem your output is handled as latin1 instead of UTF-8. It would probably be better if you fixed your script output to correctly display "as UTF-8" instead if your output environment (e.g. web browser) supports it. Because otherwise you will have problems if you need to handle characters that can not be shown in latin1. Though if you output to CLI terminal/console instead then of course you should use the same encoding as your default terminal encoding (which can be UTF-8 or something else). I prefer to have my Linux terminals also configured as UTF-8.