I'm using this piece of code to download a webpage (using request
library) and decode everything (using iconv-lite
library). The loader
function is for finding some elements from the body of the website, then returning them as a JavaScript object.
request.get({url: url, encoding: null}, function(error, response, body) {
// if webpage exists, process it, otherwise throw 'not found' error
if (response.statusCode === 200) {
body = iconv.decode(body, "iso-8859-1");
const $ = cheerio.load(body);
async function show() {
var data = await loader.getDay($, date, html_tags, thumbs, res, image_thumbnail_size);
res.send(JSON.stringify(data));
}
show();
} else {
res.status(404);
res.send(JSON.stringify({"error":"No content for this date."}))
}
});
The pages are encoded in ISO-8859-1 format, and the content is looking normal, there are no bad chars. When I wasn't using iconv-lite
, some characters, eg. ü
, were looking like this: �. Now, when I'm using the library like in the code provided above, most of the chars are looking good, but some, eg. š
are an empty box, even though they're displayed without any problems on the website.
I'm sure it's not cheerio's issue, because when I printed the output using res.send(body);
or res.send(JSON.stringify({"body":body}));
, the empty box character was still present there. Maybe it's a problem with Express? Is there a way to fix that?
EDIT:
I copied the empty box character to Google, and it has changed to š
, maybe that's important
Also, I tried to change output of Express using res.charset
but that didn't help.
I used this website: https://validator.w3.org/nu/?doc=https%3A%2F%2Fapod.nasa.gov%2Fapod%2Fap170813.html to check if the page I'm scraping really has ISO-8859-1
encoding, it turned out that it has Windows-1252
encoding. I changed the encoding in my API (var encoding = 'windows-1252'
) and it works well now.