Search code examples
htmlnode.jsweb-scrapingcheerio

NodeJS not detecting or removing question-marks (?) in scraped HTML


I have a program I'm developing in NodeJS that utilises request and cheerio packages to do some scraping for a research project. Part of the data that is scraped is news article titles. When scraping some of these titles, extended special characters (like —, a big dash) are being read as ?—? in the webpage. This is how the I'm fetching the pages and loading it into cheerio. The question marks exist both in the pure html response and the cheerio object.

function aRequest(url){
    return new Promise((res, rej)=>{
        request({
            url: url,
            headers: {
                'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36'
            }
        }, (err, resp, html)=>{
            if(!err){
                res(cheerio.load(html));
            } else {
                rej(err);
            }
        });
    });
}

These question-marks surrounding the special character do not exist in the original title, so I'm attempting to remove them (and in the process I end up removing the big dash too, although that isn't really a problem). A lot of the solutions I've tried don't seem to work. Here's some of the methods I've tried, including answers listed in the following SO questions:

Remove ASCII question mark

Remove all special characters with regexp

The answer listed in the special character removal works to remove the dash, but the question marks still exist. Some code snippets of things I've tried that do not work:

.replace("?—?", " — ");
.replace(/[^\w\s]/gi, " — ");
.replace("?", "");
.replace(/[?]/gi, " ");
.replace("�", ""); // ASCII question mark
// this is the point I started getting desperate to just have it work
.replace(/[^\w\s]/gi, "").replace("??", " — ");

I figure I could probably get the index of where the occurs, and remove the characters one index to the left and right of it, although that seems like a last resort kind of thing.

Furthermore, removing even regular question marks from the strings don't seem for work. For example, if I have a title of "This is a title?", while I've been doing all of these replace operations on question marks (like just .replace(/[?]/gi, "");) it does not remove these question marks either.

Am I missing something here? I have a feeling the question mark is some kind of non-english character instead of an actual question mark, although I'm not sure what it would be. How can I remove the ?—? and just replace it with ?

My Node version is v10.15.0, and I'm using the latest versions of cheerio and request available from npm.

EDIT: I've since found this question, which experienced a similar problem. I tried removing the characters by character code 57399 (which is what that person experienced), but it still did not remove them. Will attempt to identify the char code of the question marks.


Solution

  • For some reason, the question marks were an abnormal character code for a question mark. The character code was actually 8202, which is why replacing using a standard question mark (?) was not working. Using the following replace snippet worked to replace the question marks how I wanted them:

    const abq = String.fromCharCode(8202);
    .replace(abq+"—"+abq, " — ");
    

    I also wanted to replace any other of these abnormal characters with regular question marks, so then I also did:

    .replace(new ReqExp(abq), "?");
    

    EDIT: searching up the character, it looks like it's actually a hair space not a question mark. So instead of replacing them with normal question marks, I just replace them with a normal space.

    .replace(new ReqExp(abq), " ");