Search code examples
phpencodingutf-16utf

PHP utf encoding problem


How can I encode strings on UTF-16BE format in PHP? For "Demo Message!!!" the encoded string should be '00440065006D006F0020004D00650073007300610067006'. Also, I need to encode Arabic characters to this format.


Solution

  • First of all, this is absolutly not UTF-8, which is just a charset (i.e. a way to store strings in memory / display them).

    WHat you have here looks like a dump of the bytes that are used to build each characters.

    If so, you could get those bytes this way :

    $str = utf8_encode("Demo Message!!!");
    
    for ($i=0 ; $i<strlen($str) ; $i++) {
        $byte = $str[$i];
        $char = ord($byte);
        printf('%02x ', $char);
    }
    

    And you'd get the following output :

    44 65 6d 6f 20 4d 65 73 73 61 67 65 21 21 21 
    

    But, once again, this is not UTF-8 : in UTF-8, like you can see in the example I've give, `D` is stored on only one byte : `0x44`

    In what you posted, it's stored using two Bytes : 0x00 0x44.

    Maybe you're using some kind of UTF-16 ?



    EDIT after a bit more testing and @aSeptik's comment : this is indeed UTF-16.

    To get the kind of dump you're getting, you'll have to make sure your string is encoded in UTF-16, which could be done this way, using, for example, the mb_convert_encoding function :

    $str = mb_convert_encoding("Demo Message!!!", 'UTF-16', 'UTF-8');
    

    Then, it's just a matter of iterating over the bytes that make this string, and dumping their values, like I did before :

    for ($i=0 ; $i<strlen($str) ; $i++) {
        $byte = $str[$i];
        $char = ord($byte);
        printf('%02x ', $char);
    }
    

    And you'll get the following output :

    00 44 00 65 00 6d 00 6f 00 20 00 4d 00 65 00 73 00 73 00 61 00 67 00 65 00 21 00 21 00 21 
    

    Which kind of looks like what youy posted :-)

    (you just have to remove the space in the call to printf -- I let it there to get an easier to read output=)