Search code examples
phpcurljsonmicrosoft-translator

decoding the JSON output from Microsoft translator API with PHP


this issue seems specific to microsofttranslator.com so please ... any answers, if you can test against it ...

Using the following URL for translation: http://api.microsofttranslator.com/V2/Ajax.svc/TranslateArray .. I send via cURL some fantastic arguments, and get back the following result:

 [
      {
           "From":"en",
           "OriginalTextSentenceLengths":[13],
           "TranslatedText":"我是最好的",
           "TranslatedTextSentenceLengths":[5]
      },
      {
           "From":"en",
           "OriginalTextSentenceLengths":[16],
           "TranslatedText":"你是最好的",
           "TranslatedTextSentenceLengths":[5]
      }
 ]

When I use json_decode($output, true); on the output from cURL, json_decode gives an error about the syntax not being appropriate in the returned JSON:

 json_last_error() == JSON_ERROR_SYNTAX

The headers being returned with the JSON:

Response Headers

 Cache-Control:no-cache
 Content-Length:244
 Content-Type:application/x-javascript; charset=utf-8
 Date:Sat, 06 Aug 2011 13:35:08 GMT
 Expires:-1
 Pragma:no-cache
 X-MS-Trans-Info:s=63644

Raw content:

 [{"From":"en","OriginalTextSentenceLengths":[13],"TranslatedText":"我是最好的","TranslatedTextSentenceLengths":[5]},{"From":"en","OriginalTextSentenceLengths":[16],"TranslatedText":"你是最好的","TranslatedTextSentenceLengths":[5]}]

cURL code:

    $texts = array("i am the best" => 0, "you are the best" => 0);
    $ch = curl_init(); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $data = array(
        'appId' => $bing_appId,
        'from' => 'en',
        'to' => 'zh-CHS',
        'texts' => json_encode(array_keys($texts))
    );
    curl_setopt($ch, CURLOPT_URL, $bingArrayUrl . '?' . http_build_query($data)); 
    $output = curl_exec($ch); 

Solution

  • The API is returning a wrong byte order mark (BOM).
    The string data itself is UTF-8 but is prepended with U+FEFF which is a UTF-16 BOM. Just strip out the first two bytes and json_decode.

    ...
    $output = curl_exec($ch);
    // Insert some sanity checks here... then,
    $output = substr($output, 3);
    ...
    $decoded = json_decode($output, true);
    

    Here's the entirety of my test code.

    $texts = array("i am the best" => 0, "you are the best" => 0);
    $ch = curl_init(); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $data = array(
        'appId' => $bing_appId,
        'from' => 'en',
        'to' => 'zh-CHS',
        'texts' => json_encode(array_keys($texts))
        );
    curl_setopt($ch, CURLOPT_URL, $bingArrayUrl . '?' . http_build_query($data)); 
    $output = curl_exec($ch);
    $output = substr($output, 3);
    print_r(json_decode($output, true));
    

    Which gives me

    Array
    (
        [0] => Array
            (
                [From] => en
                [OriginalTextSentenceLengths] => Array
                    (
                        [0] => 13
                    )
    
                [TranslatedText] => 我是最好的
                [TranslatedTextSentenceLengths] => Array
                    (
                        [0] => 5
                    )
    
            )
    
        [1] => Array
            (
                [From] => en
                [OriginalTextSentenceLengths] => Array
                    (
                        [0] => 16
                    )
    
                [TranslatedText] => 你是最好的
                [TranslatedTextSentenceLengths] => Array
                    (
                        [0] => 5
                    )
    
            )
    
    )
    

    Wikipedia entry on BOM