Search code examples
google-apps-scriptweb-scrapingcheerio

Scraping a url value using contains and Cheeriogs


I use the Cheeriogs library for scraping:

https://github.com/tani/cheeriogs

This is the element I need to collect the value href:

<a class="tnmscn" itemprop="url" href="/en/predictions-tips-wealdstone-solihull-moors-1455115">

This is the code I'm currently using to extract the value.:

const contentText = UrlFetchApp.fetch(url).getContentText();
const $ = Cheerio.load(contentText);

const scrapurl = $('div.schema > div > div.tnms > div > a.tnmscn');
const urlmatch = $(scrapurl).attr('href').trim();
Logger.log(urlmatch);

But it's not reliable for my fear of ending up changing positions on the site and collecting other links other than the one that appears in the clickable element in that position:

enter image description here

So I'd like to make it more secure, so I tried using:

div.schema > div > div.tnms > div > a:contains("/en/predictions-tips")

That didn't work. How should I use contains for this need?

Add infos:

Page Link
https://www.forebet.com/en/teams/wealdstone

Image to element

enter image description here


Solution

  • In your situation, how about the following selectors?

    From:

    const scrapurl = $('div.schema > div > div.tnms > div > a.tnmscn');
    

    To:

    const scrapurl = $('a.tnmscn[href^="/en/predictions"]');
    

    or

    const scrapurl = $('div.schema > div > div.tnms > div > a.tnmscn[href^="/en/predictions"]');
    

    or

    const scrapurl = $('div.schema > div > div.tnms > div > a[href^="/en/predictions"]');
    
    • In the above all-modified scripts, /en/predictions-tips-wealdstone-solihull-moors-1455115 is retrieved.
    • In above selectors, the start text of href in the tag a and the tag a with the class tnmscn is /en/predictions.

    But, from the URL you are using, 2 values are retrieved. This has already been mentioned by Granitosaurus's comment. So I think that when you want to retrieve the 1st value, the above modification for your script can be used.

    If you want to retrieve 2 values, how about the following modification?

    Modified script:

    In this modification, the above modified selectors can be also used.

    const url = "https://www.forebet.com/en/teams/wealdstone";
    const contentText = UrlFetchApp.fetch(url).getContentText();
    const $ = Cheerio.load(contentText);
    const scrapurl = $('div.schema > div > div.tnms > div > a.tnmscn[href^="/en/predictions"]'); // and a.tnmscn[href^="/en/predictions"]
    $(scrapurl).each(function() {
      const urlmatch = $(this).attr('href');
      console.log(urlmatch);
    });
    
    • When this script is run, the following result is obtained.

        /en/predictions-tips-wealdstone-solihull-moors-1455115
        /en/predictions-tips-crawley-town-leyton-orient-1474259
      

    References: