Search code examples
javascriptgoogle-apps-scriptweb-scrapingcheerio

Scraping data from a table from a specific title value and filter specific lines (Google App Script)


Documentation for CherrioGS:
https://github.com/tani/cheeriogs

The idea is to collect only data from the table with the name Argentinos Jrs and that lines with the value Away on International duty in the info column are not saved.

Note: I really need to specify according to the value Argentinos Jrs and remove Away on International duty, because the position of this table is not fixed and the values in lines too.

The expected result in this example I'm looking for is this:

Carlos Quintana      Mid August
Jonathan Sandoval    Early August

The website link is this:
https://www.sportsgambler.com/injuries/football/argentina-superliga/

I will leave the current image of the site because if the data changes, the idea of my example is registered:
enter image description here

The code I try:

function PaginaDoJogo() {
    var sheet = SpreadsheetApp.getActive().getSheetByName('Dados Importados');
    var url = 'https://www.sportsgambler.com/injuries/football/argentina-superliga/';

    const contentText = UrlFetchApp.fetch(url).getContentText();
    const $ = Cheerio.load(contentText);

    $('div:contains("Argentinos Jrs") > div > div.inj-container:not(contains("Away on International duty")) > span.inj-player')
        .each((index, element) => {
            sheet.getRange(index + 2, 1).setValue($(element).text());
        });

    $('div:contains("Argentinos Jrs") > div > div.inj-container:not(contains("Away on International duty")) > span.inj-return.h-sm')
        .each((index, element) => {
            sheet.getRange(index + 2, 2).setValue($(element).text());
        });
}

Solution

  • function PaginaDoJogo() {
      const sheet = SpreadsheetApp.getActive().getSheetByName('Dados Importados');
      const url = 'https://www.sportsgambler.com/injuries/football/argentina-superliga/';
      const response = UrlFetchApp.fetch(url);
      const content = response.getContentText();
      const match = content.match(/Argentinos Jrs[\s\S]+?<!--Livestream call to action-->/);
      const regExp = /<div[\s\S]+?<span class="inj-player">(.+?)<\/span>[\s\S]+?<span class="inj-info">(.+?)<\/span>[\s\S]+?<span class="inj-return h-sm">(.+?)<\/span>[\s\S]+?<\/div>/g;
      const values = [];
      while ((r = regExp.exec(match[0])) !== null) {
        // console.log(r[1], r[2], r[3]);
        if (r[1] !== 'Name' && r[2] !== 'Away on International duty') {
          values.push([r[1], r[3]]);
        }
      }
      sheet.getRange(2, 1, values.length, 2).setValues(values);
    }