Search code examples
javascriptweb-scrapingcheerio

Cherrio JS return all image SRCs of parent div


Consider the following HTML:

<div aria-roledescription="carousel" data-disliderguid="slider772" class="di-slider slider772-slider gmus-1800x760-slider">
<div class="swiper-container">
<div class="swiper-wrapper">
<div
   class="di-slide swiper-slide"
   data-guid="slide2221"
   data-screen="desktop"
   data-title="995_2024_All_Hummer_Evergreen_2024_DWC"
   data-id="2221"
   data-filtervalue=""
   data-swiper-autoplay="3000">
   
   <div class="di-slider-disclaimer">
      <button class="di-slider-disclaimer-toggle" aria-expanded="false">
      <span class="inactive-label">Important Information</span>
      <span class="active-label">Hide Information</span>
      </button>
      <div class="di-slider-disclaimer-container">
         <div class="di-slider-disclaimer-contents">
            Preproduction and simulated models shown throughout. Actual production model may vary. HUMMER EV is available from a GMC EV dealer.                                
         </div>
      </div>
   </div>
   <a class="di-slider-link"
      aria-hidden="true"
      href="/new-vehicles/?_dFR%5Byear%5D%5B0%5D=2024&_dFR%5Bmake%5D%5B0%5D=GMC&_dFR%5Bmodel%5D%5B0%5D=HUMMER+EV&_dFR%5Bmodel%5D%5B1%5D=HUMMER+EV+SUV&_dFR%5Bmodel%5D%5B2%5D=HUMMER+EV+Pickup"
      title=""
      tabindex="-1"
      >
      <picture class="slide-image">
         <source media="(max-width: 767px)"                                     srcset="https://gtmassets.dealerinspire.com/9061-995_2024_All_Hummer_Evergreen_2024_DWC_600x400.jpg">
         <source media="(min-width: 768px)"
            srcset="https://gtmassets.dealerinspire.com/9061-995_2024_All_Hummer_Evergreen_2024_DWC_1800x760.jpg">
         <img src="https://gtmassets.dealerinspire.com/9061-995_2024_All_Hummer_Evergreen_2024_DWC_1800x760.jpg"                                      alt="GMC HUMMER EV PICKUP AND SUV"
            style=""
            width="1800" height="760">
      </picture>
   </a>
</div>
<div
   class="di-slide swiper-slide"
   data-guid="slide950"
   data-screen="desktop"
   data-title="Generic"
   data-id="950"
   data-filtervalue=""
   >
<picture class="slide-image">
   <source media="(max-width: 767px)"                                     srcset="https://di-uploads-development.dealerinspire.com/robertsonsgmc-winback0123/uploads/2023/03/Group-of-2023-GMC-Terrain-SUVs-parked-on-beach_mobile.jpg">
   <source media="(min-width: 768px)"
      srcset="https://di-uploads-development.dealerinspire.com/robertsonsgmc-winback0123/uploads/2023/03/Group-of-2023-GMC-Terrain-SUVs-parked-on-beach-1800x760.jpg">
   <img src="https://di-uploads-development.dealerinspire.com/robertsonsgmc-winback0123/uploads/2023/03/Group-of-2023-GMC-Terrain-SUVs-parked-on-beach-1800x760.jpg"                                      alt="Group of 2023 GMC Terrain SUVs parked on beach"
      style="visibility:hidden"
      width="1800" height="760">
</picture>

I am trying to use Cheerio via ScrapeNinja to return the SRC of all images that are children of the Div class di-slider, as seen in the first line of the HTML snippet. All images are of the HTML picture object, and all have a similar div class. However, the only link I want returned is the value.

When I try to run the following code on their sandbox: https://scrapeninja.net/cheerio-sandbox/basic, I get an error "Error: Expected name, found ://gtmassets.dealerinspire.com/9061-995_2024_All_Hummer_Evergreen_2024_DWC_1800x760.jpg on line 19"

Here is the error I'm getting:

// define function which accepts body and cheerio as args
function extract(input, cheerio) {
    // return object with extracted values              
    let $ = cheerio.load(input);
    var listItems = $(".di-slider");
    listItems.each(function(idx, picture) {
    let image= $(picture).find('img').attr('src'); 
    return {
        source: $(image)
    };
});
    
}

I admit, I am not the greatest with JS, I haven't used jQuery in years, and this is my first time trying to use cheerio or scrapeninja.

I have reviewed documentation at https://pixeljets.com/blog/cheerio-sandbox-cheatsheet/#iterate-over-children-and-return-them-as-an-array-of-objects, and I built my function off of How to get image url by cheerio?


Solution

  • A few issues:

    1. The main reason for the crash is that you're putting a string into a Cheerio object: $("https://gtmassets.dealerinspire.com..."). Remove the $() here.
    2. .forEach/.each doesn't return a value. Anything you return from it is ignored. .map, on the other hand, allocates an array using all of the values returned by your callback. This is the best function for the job. You could also push each item onto an array variable, but that's the express purpose of the map abstraction.
    3. You need to return something from extract(). Formatting your code may make it easier to notice this.

    Working code:

    function extract(input, cheerio) {
      const $ = cheerio.load(input);
      return [...$(".di-slider")].map(e => ({
        source: $(e).find("img").attr("src")
      }));
    }
    

    To get all of the src attributes for all images inside each slider, you can use a nested map:

    function extract(input, cheerio) {
      const $ = cheerio.load(input);
      return [...$(".di-slider")].map(e => ({
        sources: [...$(e).find("img")].map(e => $(e).attr("src"))
      }));
    }