Search code examples
javascriptnode.jsweb-scrapingscreen-scrapingpuppeteer

Scraping: Only first images are scraped, rest is filled with placeholder. Why?


I am scraping a job site using JavaScript with a the headless browser Puppeteer.

I am grabbing the first 6 company logo items from a job site successfully. However, after these first 6 logos, it will suddenly stop printing out the real logos (hence, providing me the src URLs), but instead inputs a placeholder image.

What could be the reason for this?

Just FYI, I am grabbing the images like this:

const image = card.querySelector('div.job-element__logo img').src

Solution

  • The images are being lazy loaded.

    The correct src of the images that have not been loaded yet are stored in a data attribute called data-src. You can use page.evaluate() in conjunction with Array.from() to filter and scrape all of the correct image src values:

    const images = await page.evaluate(() => {
      return Array.from(document.querySelectorAll('.job-element__logo img'), e => e.dataset.src ? `https://www.stepstone.de${e.dataset.src}` : e.src);
    });
    

    If you would like to scrape the position, company, description, and image for each job, you can use the following solution:

    const jobs = await page.evaluate(() => {
      return Array.from(document.querySelectorAll('.job-element'), card => {
        const position = card.querySelector('.job-element__body__title').textContent.trim();
        const company = card.querySelector('.job-element__body__company').textContent.trim();
        const description = card.querySelector('.job-element__body__details').textContent.trim();
        const image_element = card.querySelector('.job-element__logo img');
        const image = image_element.dataset.src ? `https://www.stepstone.de${image_element.dataset.src}` : image_element.src;
    
        return {
          position,
          company,
          description,
          image,
        };
      });
    });