Search code examples
phpencodingutf-8character-encodingmbstring

Converting Unicode reference to UTF-8 character in PHP with mbstring


I have a set of data inside a database which has been input with unicode characters, but they were interpreted as a string. That is, where there should be an apostrophe I've actually got \u2019

So I now need to convert this into its character representation, which is . Firstly it is quite easy to change the string into its entity version: ’, then I need to turn it into the correct UTF-8 multibyte string.

I have attempted to do this in a number of ways; on my local server I can exctract the characters with a preg_match function and then pass each to the following function:

mb_convert_encoding($string, "UTF-8", "HTML-ENTITIES");

Sounds quite sensible, and works without issue. Turning off the UTF-8 charset in the browser shows that this has actually converted into ’ when read by the browser default encoding.

However, the exact same code when run in my production environment produces the dreaded "missing symbol" box when rendered as UTF-8. Turning off UTF-8 and it has produced whatever byte stream renders as ò°‘£. It appears to be outputting 4 bytes rather than 3, I don't know if that is relevant as I'm not well read on character encoding.

I assume that the issue is with my mbstring settings. Here are the mbstring settings from my local server:

Multibyte Support   enabled
Multibyte string engine libmbfl
HTTP input encoding translation disabled
Multibyte (japanese) regex support  enabled
Multibyte regex (oniguruma) version 4.7.1
mbstring.detect_order   no value    no value
mbstring.encoding_translation   Off Off
mbstring.func_overload  0   0
mbstring.http_input auto    auto
mbstring.http_output    UTF-8   UTF-8
mbstring.http_output_conv_mimetypes ^(text/|application/xhtml\+xml)^(text/|application/xhtml\+xml)
mbstring.internal_encoding  UTF-8   UTF-8
mbstring.language   neutral neutral
mbstring.strict_detection   Off Off
mbstring.substitute_character   no value    no value

There are a few differences on my production environment:

Multibyte Support   enabled
Multibyte string engine libmbfl
Multibyte (japanese) regex support  enabled
Multibyte regex (oniguruma) version 3.7.1
mbstring.detect_order   no value    no value
mbstring.encoding_translation   Off Off
mbstring.func_overload  0   0
mbstring.http_input auto    auto
mbstring.http_output    UTF-8   UTF-8
mbstring.internal_encoding  UTF-8   UTF-8
mbstring.language   neutral neutral
mbstring.strict_detection   Off Off
mbstring.substitute_character   no value    no value

Anyone see what I'm doing wrong?


Solution

  • See if this can help you: hex2ascii and ascii2hex

    ADDED on 09-19-2012:

    function ascii2hex($ascii)
    {
        $hex = '';
        for ($i = 0; $i < strlen($ascii); $i++)
        {
            $byte = strtoupper(dechex(ord($ascii{$i})));
            $byte = str_repeat('0', 2 - strlen($byte)).$byte;
            $hex .= $byte." ";
        }
        return $hex;
    }
    
    function hex2ascii($hex)
    {
        $ascii = '';
        $hex = str_replace(" ", "", $hex);
        for($i = 0; $i < strlen($hex); $i = $i+2)
            $ascii .= chr(hexdec(substr($hex, $i, 2)));
    
        return($ascii);
    }