Search code examples
javascriptweb-scrapingurlhrefcheerio

href values I get start with "/url?q=" using Cheerio


I am trying to use cheerio for scraping. However I got small problem. All the href values I get on the client side start with "/url?q=". For example like this:

'/url?q=https://www.nimh.nih.gov/health/topics/auti… pkCZkQFnoECAYQAg&usg=AOvVaw1E4L1bLVm9OdBSFMkjJftQ'

The element from the google search is:

<a jsname="UWckNb" href="https://www.nimh.nih.gov/health/topics/autism-spectrum-disorders-asd"...

It doesn't contain "/url?q=". Where does "/url?q=" come from?

app.get('/scrape', (req, res) => {
    request('https://www.google.com/search?q=asd', (error, response, html) => {
        if (response.statusCode == 200) {
            
            const $ = cheerio.load(html);
            const results = [];
            const links = $('a'); 
            links.each((index, link) => {
                const href = $(link).prop('href'); 
                const h3 = $(link).find('h3'); 
                
                if (h3.length > 0) {
                    const textContent = h3.text().trim();
                    results.push({ href, textContent }); 
                }
            });
        
            const responseData = {
                links: results,
                total: results.length
            };

            res.json(responseData); 
        } else {
            console.error('Unexpected status code:', response.statusCode);
            res.status(500).send('Unexpected status code.'); 
        }
    });
});

I know that I can solve it like this:

 const actualUrl = decodeURIComponent(href.split('/url?q=')[1].split('&')[0]);

But I would like to know where this "/url?q=", what am I doing wrong?


Solution

  • That's just how the URLs are in the static HTML sent from the server. Apparently some JS runs after load and trims the hrefs, but since Cheerio doesn't run JS, there's not much you can do about that, aside from switching to a browser automation library like Puppeteer. Be wary of what you see in devtools--it includes dynamic JS scripts.

    I'd use caution with decodeURIComponent(href.split('/url?q=')[1].split('&')[0]); because there may be query parameters behind the & that matter for the page. I'm not sure what the pattern is, but &sa=U seems to be the Google postfix.

    Also:

    • Your selector can be simplified to search only for a > h3, or use a class, which seems pretty stable (this answer from last year which uses a class is still valid and more direct).
    • There's no need to send the .length to the client. Arrays have built-in length so they can just access that naturally. A cached length is poor practice because it's unnecessary and can easily go out of sync with the actual length.
    • Avoid using request. It's deprecated and callbacks are out of vogue. Prefer promises--fetch is now standard in Node 18+, and there's axios as well, which is also preferred over request.