Search code examples
phpmysqlutf-8utf8mb4mb-convert-encoding

utf8_encode and mb_convert_encoding Conflicting Resuts


I have a very annoying issue when transferring data from PHP (which is set up with default_charset = 'UTF-8') to MySQL, on an IIS environment with web.config set to:

<globalization requestEncoding="UTF-8" responseEncoding="UTF-8" />

Also tested on an Apache Environment with same default charset with same results. I also set MySQLi to set charset before any query:

$mysqli->set_charset('UTF-8');

I also have a set of prepared statements that search each character in a string to convert any none UTF-8 encoding to UTF-8 using:

mb_convert_encoding($char, 'UTF-8', mb_detect_encoding($char));

After trying and failing by parsing the whole string with the same process:

mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string));

The content type of the file with the script is set to content-type charset UTF-8 on the header. And still none of this has fixed the issue.

I also, have set all collations of the MySQL tables and schemas to utf8mb4, and of course using `utf8mb4_unicode_520_ci'. Still, issue persists.

The function I wrote to split the string is as follows:

    static private function split_and_convert($value, $encoding = 'UTF-8') {
        $split = mb_str_split($value);
        $filter = [];
        foreach($split as $chr) {
            $from = mb_detect_encoding($chr);
            $encoded = ($from !== $encoding)
                ? mb_convert_encoding($chr, $encoding, $from)
                : $chr;
            $filter[] = $encoded;
            // echoing for testing
            echo 'char ' . $encoded . ' (' . $from . ')<br>';
        }
        return join('', $filter);
    }

Which echo the following trying to save Martínez:

char M (ASCII)
char a (ASCII)
char r (ASCII)
char t (ASCII)
char � (UTF-8)
char n (ASCII)
char e (ASCII)
char z (ASCII)

With MySQL throwing the following error:

Error No: 1366 - Incorrect string value: '\xEDnez' for column 'contactLast' at row 1

Still no joy. But if I do this, it works like a charm:

    static private function split_and_convert($value, $encoding = 'UTF-8') {
        $split = mb_str_split($value);
        $filter = [];
        foreach($split as $chr) {
            $from = mb_detect_encoding($chr);
            $encoded = ($from !== $encoding)
                ? mb_convert_encoding($chr, $encoding, $from)
                // set to actually encode the 'UTF-8' encoded char
                // I'm supposed not to do this, but is what works
                // which makes no sense
                : utf8_encode($chr);
            $filter[] = $encoded;
            // echoing for testing
            echo 'char ' . $encoded . ' (' . $from . ')<br>';
        }
        return join('', $filter);
    }

With the following results:

char M (ASCII)
char a (ASCII)
char r (ASCII)
char t (ASCII)
char í (UTF-8)
char n (ASCII)
char e (ASCII)
char z (ASCII)

And MySQL not throwing any error. If I replace the utf8_encode with mb_convert_encoding($encoded, 'UTF-8'), I get the same errors as not filtering the string.

I'm suspecting mb_detect_encoding is either detecting erroneously the encoding of the characters or simply the mb_convert_encoding is not doing it's job.

If anyone can help I will greatly appreciate it. I'm been searching for hours for a solution and none of the solutions offered have worked. And I find myself in a pickle because what works has been deprecated.

So, these are the list of the issues I'm facing:

  1. mb_convert_encoding -> not encoding or encoding to something other than UTF8
  2. mb_detect_encoding -> not detecting correctly
  3. Strings is encoded as ASCII not UTF-8 as charset in header requested.
  4. Confusion: Why I need to encode to UTF-8 an UTF-8 encoded character to work?

Solution

  • After looking for answers and long day of research this was the issue. Over processing. At some point during development a middleware was put in place to verify each POST/GET request and process each value to ensure the encoding sent was UTF-8. The code I'm refactoring already took care of processing the data.

    The middleware was coded using utf8_encode, so, I'm suspecting the function check itself to avoid over processing. I have to refactor the middleman as well so there's no conflict.

    So, that was the issue, two different codes were processing the encoding creating the conflict. As soon I refactor the middleware the issue desisted.

    In conclusion, the problem is not mb_convert_encoding or mb_detect_encoding, if you have a similar issue, make sure the code you are working on is not using two different methods doing similar processes with different functions.

    In my case, the middleware was not verifying or set to detect if the string was already encoded to UTF-8, it was just implementing the encoding regardless. And the reason I've been paid to refactor the code.

    Still what does not make sense to me is why a properly UTF-8 encoded string caused MySQL to throw an error. For that I suspect the over processing was corrupting the input.