Search code examples
node.jscharacter-encodingiso-8859-1iconv

iconv-lite not decoding everything properly, even though I'm using proper decoding


I'm using this piece of code to download a webpage (using request library) and decode everything (using iconv-lite library). The loader function is for finding some elements from the body of the website, then returning them as a JavaScript object.

request.get({url: url, encoding: null}, function(error, response, body) {
        // if webpage exists, process it, otherwise throw 'not found' error
        if (response.statusCode === 200) {
          body = iconv.decode(body, "iso-8859-1");
          const $ = cheerio.load(body);
          async function show() {
            var data = await loader.getDay($, date, html_tags, thumbs, res, image_thumbnail_size);
            res.send(JSON.stringify(data));
          }
          show();
        } else {
          res.status(404);
          res.send(JSON.stringify({"error":"No content for this date."}))
        }
      });

The pages are encoded in ISO-8859-1 format, and the content is looking normal, there are no bad chars. When I wasn't using iconv-lite, some characters, eg. ü, were looking like this: �. Now, when I'm using the library like in the code provided above, most of the chars are looking good, but some, eg. š are an empty box, even though they're displayed without any problems on the website.

I'm sure it's not cheerio's issue, because when I printed the output using res.send(body); or res.send(JSON.stringify({"body":body}));, the empty box character was still present there. Maybe it's a problem with Express? Is there a way to fix that?

EDIT: I copied the empty box character to Google, and it has changed to š, maybe that's important

Also, I tried to change output of Express using res.charset but that didn't help.


Solution

  • I used this website: https://validator.w3.org/nu/?doc=https%3A%2F%2Fapod.nasa.gov%2Fapod%2Fap170813.html to check if the page I'm scraping really has ISO-8859-1 encoding, it turned out that it has Windows-1252 encoding. I changed the encoding in my API (var encoding = 'windows-1252') and it works well now.