I am trying to use cheerio for scraping. However I got small problem. All the href values I get on the client side start with "/url?q=". For example like this:
'/url?q=https://www.nimh.nih.gov/health/topics/auti… pkCZkQFnoECAYQAg&usg=AOvVaw1E4L1bLVm9OdBSFMkjJftQ'
The element from the google search is:
<a jsname="UWckNb" href="https://www.nimh.nih.gov/health/topics/autism-spectrum-disorders-asd"...
It doesn't contain "/url?q="
. Where does "/url?q="
come from?
app.get('/scrape', (req, res) => {
request('https://www.google.com/search?q=asd', (error, response, html) => {
if (response.statusCode == 200) {
const $ = cheerio.load(html);
const results = [];
const links = $('a');
links.each((index, link) => {
const href = $(link).prop('href');
const h3 = $(link).find('h3');
if (h3.length > 0) {
const textContent = h3.text().trim();
results.push({ href, textContent });
}
});
const responseData = {
links: results,
total: results.length
};
res.json(responseData);
} else {
console.error('Unexpected status code:', response.statusCode);
res.status(500).send('Unexpected status code.');
}
});
});
I know that I can solve it like this:
const actualUrl = decodeURIComponent(href.split('/url?q=')[1].split('&')[0]);
But I would like to know where this "/url?q="
, what am I doing wrong?
That's just how the URLs are in the static HTML sent from the server. Apparently some JS runs after load and trims the hrefs, but since Cheerio doesn't run JS, there's not much you can do about that, aside from switching to a browser automation library like Puppeteer. Be wary of what you see in devtools--it includes dynamic JS scripts.
I'd use caution with decodeURIComponent(href.split('/url?q=')[1].split('&')[0]);
because there may be query parameters behind the &
that matter for the page. I'm not sure what the pattern is, but &sa=U
seems to be the Google postfix.
Also:
a > h3
, or use a class, which seems pretty stable (this answer from last year which uses a class is still valid and more direct)..length
to the client. Arrays have built-in length so they can just access that naturally. A cached length is poor practice because it's unnecessary and can easily go out of sync with the actual length.request
. It's deprecated and callbacks are out of vogue. Prefer promises--fetch is now standard in Node 18+, and there's axios as well, which is also preferred over request
.