Search code examples
node.jsexpressweb-scrapingaxioscheerio

How to scrape from 2 divs that are on the same level with Cheerio


I'm trying to web scrape content from 2 different divs that are on the same level. I'm using NodeJS, Axios, Cheerio and Express.

Basically, I'm trying to collect an image and the info related to it, but they are placed of different divs that are on the same level. Using the "main" doesn't seem to work in my case.

<div class="main">
    <div class="one">
        // image
    </div>
    <div class="two">
        // info
    </div>
</div>

Below is my code to get the data from a website:

var leafletList = $('.store-flyer__info', html).each(function() {
    let leaflet = {
        title: $(this).find('h3').text(),
        image: $(this).find('source').attr('srcset'),
        link: $(this).find('a').attr('href'),
        validDate: $(this).find('small').text().slice(3,-1)
    }

    leaflets.push(leaflet)
})

Below is the website's HTML:

website's html

The way my code is right now, it's obviously getting only the title, link and validDate. But anyone knows how can I get the the srcset from the other div? I've also tried the following method, but it doesn't work:

var leafletList = $('.store-flyers', html).each(function() {
    let leaflet = {
        title: $(this).find('.store-flyer__info h3').text(),
        image: $(this).find('.store-flyer__front source').attr('srcset'),
        link: $(this).find('.store-flyer__info a').attr('href'),
        validDate: $(this).find('.store-flyer__info small').text().slice(3,-1)
    }

    leaflets.push(leaflet)
})

Solution

  • There are many ways to get the result based on the HTML snippet you show, with the caveat that the developer tools can be misleading. It shows elements created after page load with JS, which you won't have if you're only requesting the raw page HTML.

    With that in mind, here are a few options:

    const cheerio = require("cheerio"); // ^1.0.0-rc.12
    
    const html = `
    <div class="store-flyer">
      <picture>
        <source srcset="foo.jpeg" type="image/webp">
        <source srcset="bar.jpeg" type="image/jpeg">
      </picture>
    </div>
    <div class="store-flyer">
      <picture>
        <source srcset="quux.jpeg" type="image/webp">
        <source srcset="garply.jpeg" type="image/jpeg">
      </picture>
    </div>
    `;
    const $ = cheerio.load(html);
    const result = [...$(".store-flyer")].map(e => ({
      // select using `.first()` and `.last()` Cheerio methods:
      firstImage: $(e).find("source").first().attr("srcset"),
      secondImage: $(e).find("source").last().attr("srcset"),
    
      // select using CSS attribute selectors:
      firstImageByType: $(e).find('source[type="image/webp"]').attr("srcset"),
      secondImageByType: $(e).find('source[type="image/jpeg"]').attr("srcset"),
    
      // select as an array of all <source> elements:
      allImages: [...$(e).find("source")].map(e => $(e).attr("srcset")),
    }));
    console.log(result);
    

    Output:

    [
      {
        firstImage: 'foo.jpeg',
        secondImage: 'bar.jpeg',
        firstImageByType: 'foo.jpeg',
        secondImageByType: 'bar.jpeg',
        allImages: [ 'foo.jpeg', 'bar.jpeg' ]
      },
      {
        firstImage: 'quux.jpeg',
        secondImage: 'garply.jpeg',
        firstImageByType: 'quux.jpeg',
        secondImageByType: 'garply.jpeg',
        allImages: [ 'quux.jpeg', 'garply.jpeg' ]
      }
    ]
    

    Prepending .store-flyer__front to your source selectors might be a good idea if you need to disambiguate.