Search code examples
javascriptnode.jsaxiosweb-crawlercheerio

Web scraping with cheerio not working with some elements


I just started learning about web scraping and I found this tutorial: https://www.mundojs.com.br/2020/05/25/criando-um-web-scraper-com-nodejs/

It works fine, however I'm trying to get different elements from the same webpage: https://ge.globo.com/futebol/brasileirao-serie-a/

With the group of classes of the tutorial it brings all the elements with the selected class, but with other classes it doesn't work:

enter image description here

As can be seen all fifty elements with the class ranking-item-wrapper are returned, but if I select elements with the class lista-jogos__jogo it doesn't return anything:

enter image description here enter image description here

I don't get why I'm getting this error, since I'm doing exectly the same thing as it is done in the tutorial.

Here is a short version of the code:

const axios = require('axios');
const cheerio = require('cheerio');
const url = 'https://ge.globo.com/futebol/brasileirao-serie-a/';

axios(url).then(response => {
  const html = response.data;
  const $ = cheerio.load(html);
  console.log($('.ranking-item-wrapper')) // => tutorial class
  console.log('***')
  console.log($('.lista-jogos__jogo')) // => class that I'm using
}).catch(console.error);

Solution

  • I saw the answer from @Bradley and although it explained what was happening, it doesn't provide a solution. He was correct in saying the elements are being appended with Javascript. There are a few ways we can handle this to get the same data.

    I saw your response regarding waiting for the elements to load, this is possible using something like JSDOM/Puppeteer but it's completely overkill and would most likely result in bugs from unsupported JS and/or a massive CPU/Memory overhead in comparison to something like Cheerio.

    Typically, in my experience, the reason the elements are being appended with Javascript is that the data is being pulled externally from an API, this is a simple fix because you can just check the networking tools to see an XHR request which gets the data, usually it's in an easier-to-parse format too because it's being pulled from an API (JSON). This is very common now a days with all the client-side progressive web apps.

    The alternative is the data is hard-coded in a site script which can be split out into a parsable format. You might see this on progressive web apps that take advantage of server-side rendering for the SEO benefit.

    I found the data is coming from an external API which is returning JSON data. The URL Is:

    https://api.globoesporte.globo.com/tabela/d1a37fa4-e948-43a6-ba53-ab24ab3a45b1/fase/fase-unica-campeonato-brasileiro-2021/rodada/38/jogos/

    You will need to request this URL instead and parse the JSON response to get the data that you need, rather than using Cheerio.