Search code examples
javascriptnode.jsunicodeutf-8v8

Node.js unicode issue with HTTP response body


The response body of HTTP requests using the native 'http' module, displays question mark characters for unicode chars, instead of their actual value. Here's the basic snippet of code that I'm running.

var http = require('http');
var google = http.createClient(80, 'www.google.it');
var request = google.request('GET', '/',
{
 'host': 'www.google.it',
}
  );
request.end();
request.on('response', function (response) {
  response.setEncoding('utf8');
  response.on('data', function (chunk) {
    console.log(chunk);
  });
});

In the response there's a specific word that starts with "Pubblicit". Its last letter is a weird character that shows as a question mark to me. The word should be Pubblicità, instead it is displyed as Pubblicit?.

I have also tried outputting the data using .toString():

console.log(chunk.toString());

or

console.log(chunk.toString('utf8'));

But I'm getting the same results.

Any idea?


Solution

  • Reason maybe that, if we do not specify a "googleKnownAsUTF8OK" user-agent on request header, google would response a html doc with content-type of ISO-8859-1(for old browsers,bots?i dont know), so decode the response buffer by "binary" is correct.

    But, if we decode a buffer encoded in ISO-8859-1 by utf8, then the byte 0xe0(à) implies "form a character by 3bytes in a row", it is a malformed character in our case, so a few unexpected characters(depending on the environment) was displayed.

    We may try "Mozilla/5.0" as value of user-agent. Good luck.