Search code examples
node.jsweb-scrapingcheerio

Cannot find the img tag while web scraping with Cheerio


I was practising in along with when I encountered an issue.

Link of the website (unfortunately I do not have enough reputation to post images so I will have to provide you with the link)

My objective is to extract the link of the mobile phone image from the img tag which is a child of the a tag which is a child of the div element with class called plp-card-thumbnail (Chrome Devtools will help you locate it) but I get no results. I can locate the a tag but it doesn't seem to have any children, which in reality are present.

Here is the code I wrote:

const url = "https://www.croma.com/oneplus-android-phones/bc/b-0948-95";

import $ from "cheerio";
import rp from "request-promise";

rp(url)
  .then((html) => {
    const images = $(".plp-card-thumbnail a", html);
    console.log(images[0]);

    console.log("\n=============================\n");

    console.log($(images).find("img"));
  })
  .catch((err) => {
    console.error(err);
  });

and here is the cropped console output:

.
.
prev: null,
  next: null,
  startIndex: null,
  endIndex: null,
  children: [], <----
  name: 'a',
  attribs: [Object: null prototype] {
    href: '/oneplus-10-pro-5g-12gb-ram-256gb-emerald-forest-/p/250716'
  },
  type: 'tag',
  namespace: 'http://www.w3.org/1999/xhtml',
  'x-attribsNamespace': [Object: null prototype] { href: undefined },
  'x-attribsPrefix': [Object: null prototype] { href: undefined }
}

=============================

LoadedCheerio {
  length: 0, <----
  options: { xml: false, decodeEntities: true },
.
.

As you can see there is no img tag being detected so there is no way for me to extract the source. I'd like to know the reason why this is happening and if there is a solution to it. Thank You!


Solution

  • This website appends <img> tags inside each <a> tag dynamically, using JavaScript. Cheerio won't work. You'll need a headless browser solution (Puppeteer is the go-to example). Here's some working code that gets the URL (data-src attribute) for all your <img> tags.

    const puppeteer = require("puppeteer");
    
    async function getProducts() {
      const browser = await puppeteer.launch({
        headless: false,
        args: ["--no-sandbox"],
      });
      console.log("Browser launched");
      const page = await browser.newPage();
      console.log("New page opened");
      await page.goto("https://www.croma.com/oneplus-android-phones/bc/b-0948-95", { waitUntil: "domcontentloaded" });
      console.log("URL visited");
    
      // Wait for the chosen selector to appear
      await page.waitForSelector(".plp-card-thumbnail");
      console.log("Selector found");
    
      const images = await page.evaluate(() => {
        const thumbnails = document.querySelectorAll('.plp-card-thumbnail');
        const imgSrcs = [];
      
        thumbnails.forEach(thumbnail => {
          const imgs = thumbnail.querySelectorAll('img');
          imgs.forEach(img => {
            imgSrcs.push(img.getAttribute('data-src'));
          });
        });
      
        return imgSrcs;
      });
    
      console.log("Here are your images: ")
      console.log(images);
    
      await browser.close();
    }
    
    getProducts().catch((error) => {
      console.error(error);
    });
    

    References:

    Cheerio vs Puppeteer