This is a strange scenario, not conventional converting one encoding to another one.
Question
I use Readability API to retrieve main content from given url, it works fine if the target url is encoded with UTF-8, but when target url is encoded in GB2312
(one of Chinese encoding), I get rubbish information instead(the Chinese characters are wrongly encoded but English letters and digits work fine).
Deep Research
I inspected the HTTP header Readability API returns, it indicates that the encoding of API response is UTF-8
.
Here's a snippet of wrongly encoded Chinese characters:
ÄÉ´ï¶û¾ø¾³Ï´󷴻÷¾Ü¾øÀäÃÅÄæת½ú¼¶ÖÐÍøËÄÇ¿
Length: 42
Which originally are:
纳达尔绝境下大反击拒绝冷门逆转晋级中网四强
Length: 21
However, if you convert the correct Chinese into unicode, it should be:
纳达尔绝境下大反击拒绝冷门逆转晋级中网四强
Tried But Not Working
iconv("GB2312", "UTF-8", $str);
iconv("GBK", "UTF-8", $str);
iconv("GB18300", "UTF-8", $str);
mb_convert_enconding($str, "UTF-8", "GB2312");
mb_convert_enconding($str, "UTF-8", "GB18300");
mb_convert_enconding($str, "UTF-8", "GBK");
Solution Requested
Since Readability API doesn't provide a parameter for charset of target url( I call this API like https://www.readability.com/api/content/v1/parser?url=http://sports.sina.com.cn/t/2013-10-04/14596813815.shtml&token=my_token_here), I have to do the convertion when handling the API response.
I will appreciate it very much if you have any idea about this issue.
Environment Info: PHP 5.3.6
It seems that the individual bytes that make up the characters have been encoded as HTML numeric entities as if they were characters from ISO-8859-1 or some other 8-bit encoding. To undo the numeric entity encoding you can use mb_decode_numericentity
:
$str = "ÄÉ´ï¶û¾ø¾³Ï´󷴻÷¾Ü¾øÀäÃÅÄæת½ú¼¶ÖÐÍøËÄÇ¿";
$str = mb_decode_numericentity($str, array(0, 255, 0, 255), "ISO-8859-1");
echo iconv("gb2312", "utf8", $str);
This produces the expected output of 纳达尔绝境下大反击拒绝冷门逆转晋级中网四强
.