Search code examples
javascriptnode.jstypescriptweb-scrapingpuppeteer

Cannot use external function inside page.evaluate()


I am scraping dynamic website with puppeteer. My goal is to be able to create as much generic scraping logic as possible, which will also remove a lot of boilerplate code. So for that reason, I created external function that scrapes the data, given certain parameters. The problem was that when I tried to use that function inside page.evaluate() puppeteer method, I ran into a ReferenceError that this function was not defined.

Did some research and the page.exposeFunction() & page.addScriptTag() came out as a possible solutions. However when I tried to use them inside my scraper, addScriptTag() wasn't working and exposeFunction() didn't give me the ability to access DOM elements inside the exposed function. I understood that exposeFunction() is being executed inside Node.js, while addScriptTag() - in the browser, but I don't know how to proceed further with that information and if it is even valuable for my case.

Here is my scraper:

import { Browser } from "puppeteer";

import { dataMapper } from "../../utils/api/functions/data-mapper.js";

export const mainCategoryScraper = async (browser: Browser) => {
  const [page] = await browser.pages();

  await page.setUserAgent(
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
  );

  await page.setRequestInterception(true);

  page.on("request", (req) => {
    if (
      req.resourceType() === "stylesheet" ||
      req.resourceType() === "font" ||
      req.resourceType() === "image"
    ) {
      req.abort();
    } else {
      req.continue();
    }
  });

  await page.goto("https://www.ozone.bg/pazeli-2d-3d/nastolni-igri", {
    waitUntil: "domcontentloaded",
  });

  /**
   * Function will execute in Node.js
   */
  // await page.exposeFunction('dataMapper', dataMapper);

  /**
   * The way of passing DOM elements to the function, because like that the function executes in the browser
   */
  // await page.addScriptTag({ content: `${dataMapper}` });

  const data = await page.evaluate(async () => {
    const contentContainer = document.querySelector(".col-main") as HTMLDivElement;

    const carousels = Array.from(
      contentContainer.querySelectorAll(".owl-item") as NodeListOf<HTMLDivElement>
    );

    const carouselsData = await dataMapper<HTMLDivElement>(carousels, ".title", "img", "a");

    return {
      carouselsData,
    };
  });
  await browser.close();

  return data;
};

And here is the dataMapper function:

import { PossibleTags } from "../typescript/types.js";

export const dataMapper = function <T extends HTMLDivElement>(items: Array<T>, ...selectors: string[]) {
  let hasTitle = false;

  for (const selector of selectors) {
    if (selector === ".title" || selector === "h3") {
      hasTitle = true;
      break;
    }
  }
  
  return items.map((item) => {
    const data: PossibleTags = {};

    return selectors.map((selector) => {
        
      const dataProp = item.querySelector(selector);

      switch (selector) {
        case ".title": {
          data["title"] = (dataProp as HTMLSpanElement)?.innerText;
          break;
        }
        case "h3": {
          data["title"] = (dataProp as HTMLHeadingElement)?.innerText;
          break;
        }
        case "h6": {
          data["subTitle"] = (dataProp as HTMLHeadingElement)?.innerText;
          break;
        }
        case "img": {
          if (!hasTitle) {
            data["img"] = (dataProp as HTMLImageElement)?.getAttribute("src") ?? undefined;
            break;
          }

          data["title"] = (dataProp as HTMLImageElement)?.getAttribute("alt") ?? undefined;
          break;
        }
        case "a": {
          data["url"] = (dataProp as HTMLAnchorElement)?.getAttribute("href") ?? undefined;
        }
        default: {
          throw new Error("Such selector is not yet added to the possible selectors");
        }
      }
    });
  });
};

When I use the page.exposeFunction('dataMapper', dataMapper);, it tells me that item.querySelector is not a function (inside dataMapper). And with await page.addScriptTag({ content: `${dataMapper}` });, it just throws error later on inside the page.evaluate, that dataMapper is not a function.

Update: when specifying path inside the addScriptTag, it still gives me: Error [ReferenceError]: dataMapper is not defined * Just to mention that the mainCategoryScraper * is later on used in scrapersHandler function, which decides what scraper to be executed, based on URL endpoint.


Solution

  • As discussed in my comment, the approach here seems rather convoluted. I'd caution against premature abstractions.

    In general, once you need to add multiple conditions (switch and if) where there weren't any before, you may be headed down the wrong path. These increase the cognitive complexity of the code. Complexity in a function can be acceptable if it reduces complexity for the caller, but if the contract for the function isn't clear, then the abstraction may leak problems back to the caller.

    Packing all of the logic into your dataMapper function breaches the single responsibility principle and makes it unmaintainable, because you'll need to keep burdening it further with additional types of structures. The control flow within the function is already difficult to grasp and can't be extended in any sensible way. The caller should be responsible for explicitly encoding the structure to be scraped, rather than trying to write an all-in-one function that can't sensibly be written for these structures.

    Another rule of thumb: if the factoring is difficult, then just keep the repetition. Or take a step back and try to write a different abstraction, either at a higher or lower level than the first attempt.

    In this case, you might write a couple of higher-level abstractions $$evalMap and $text, which let you write your data mappers more cleanly. These abstractions just clear some of the syntax out of the way, but don't attempt to generalize scraping different structures with conditions.

    const puppeteer = require("puppeteer"); // ^22.7.1
    
    const url = "https://www.ozone.bg/pazeli-2d-3d/nastolni-igri";
    
    let browser;
    (async () => {
      browser = await puppeteer.launch();
      const [page] = await browser.pages();
      const ua =
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";
      await page.setUserAgent(ua);
      await page.setRequestInterception(true);
      const blockedResources = [
        "image",
        "fetch",
        "other",
        "ping",
        "stylesheet",
        "xhr",
      ];
      page.on("request", req => {
        if (
          !req.url().startsWith("https://www.ozone.bg") ||
          blockedResources.includes(req.resourceType())
        ) {
          req.abort();
        } else {
          req.continue();
        }
      });
      await page.evaluateOnNewDocument(
        "window.$text = (el, s) => el.querySelector(s)?.textContent.trim();"
      );
      await page.goto(url, {waitUntil: "domcontentloaded"});
    
      const $$evalMap = async (sel, mapFn) => {
        await page.waitForSelector(sel);
        return page.$$eval(
          sel,
          (els, mapFn) => els.map(new Function(`return ${mapFn}`)()),
          mapFn.toString()
        );
      };
    
      const carouselData = await $$evalMap(".owl-item", el => ({
        title: $text(el, ".title"),
        img: el.querySelector("img").src,
        url: el.querySelector("a").href,
      }));
    
      const widgetData = await $$evalMap(".widget-box", el => ({
        title: el.querySelector("img").alt,
        img: el.querySelector("img").src,
        url: el.querySelector("a").href,
      }));
    
      const sliderData = await $$evalMap(
        ".item.slick-slide",
        el => ({
          title: $text(el, "h3"),
          subTitle: $text(el, "h6"),
          img: el.querySelector("img").src,
          url: el.querySelector("a").href,
        })
      );
    
      console.log(carouselData);
      console.log(widgetData);
      console.log(sliderData);
    })()
      .catch(err => console.error(err))
      .finally(() => browser?.close());
    

    If TypeScript types are getting in the way, consider moving the querySelectors out to similar helper functions as was done with $text. The $$evalMap calls can also be moved out to individual functions for each section (scrapeCarouselData, scrapeWidgetData, scrapeSliderData, etc). Doing so would require $$evalMap to accept a page parameter, but if you're breaking out functions, it may not be necessary anyway since the complexity is hidden--plain $$eval seems perfectly acceptable too, especially if there are only three.

    Breaking up this main IIFE into sub-functions would be straightforward:

    const puppeteer = require("puppeteer");
    
    const $$evalMap = async (page, sel, mapFn) => {
      await page.waitForSelector(sel);
      return page.$$eval(
        sel,
        (els, mapFn) => els.map(new Function(`return ${mapFn}`)()),
        mapFn.toString()
      );
    };
    
    const scrapeCarouselData = page =>
      $$evalMap(page, ".owl-item", el => ({
        title: $text(el, ".title"),
        img: el.querySelector("img").src,
        url: el.querySelector("a").href,
      }));
    
    const scrapeWidgetData = page =>
      $$evalMap(page, ".widget-box", el => ({
        title: el.querySelector("img").alt,
        img: el.querySelector("img").src,
        url: el.querySelector("a").href,
      }));
    
    const scrapeSliderData = page =>
      $$evalMap(page, ".item.slick-slide", el => ({
        title: $text(el, "h3"),
        subTitle: $text(el, "h6"),
        img: el.querySelector("img").src,
        url: el.querySelector("a").href,
      }));
    
    let browser;
    (async () => {
      browser = await puppeteer.launch();
      const [page] = await browser.pages();
      // ... same optimization code,  could be moved out to a set up func
      const url = "https://www.ozone.bg/pazeli-2d-3d/nastolni-igri";
      await page.evaluateOnNewDocument(
        "window.$text = (el, s) => el.querySelector(s)?.textContent.trim();"
      );
      await page.goto(url, {waitUntil: "domcontentloaded"});
      console.log(await scrapeCarouselData(page));
      console.log(await scrapeWidgetData(page));
      console.log(await scrapeSliderData(page));
    })()
      .catch(err => console.error(err))
      .finally(() => browser?.close());
    

    $$evalMap is a little ugly but can be moved out to an imported utility function that lives in another file. The scrapers rely on $text existing, so some shared/global setup could be the place for this to live, potentially along with the optimization code in the first snippet.

    Now the main code is quite clean, and each scraping function is easy to maintain.

    Writing similar scraping functions would follow the established pattern. If a particular section doesn't adhere well to the $$evalMap paradigm, no problem--it should use its own bespoke logic rather than trying to cram it into one of the existing functions with a condition.

    Summary and further remarks:

    • Avoid premature abstractions.
    • When factoring, stop if you're introducing multiple conditions where there weren't any before. switch/case/break is particularly nasty. If your abstractions are more verbose and hard to understand than the original repeated code, don't do them, or try to find a different abstraction.
    • When factoring, write it the verbose way first, then try to abstract away the similarities (but a bit of repetition is acceptable--be honest about what's easier to read and maintain).
    • as is discouraged in TS. Use it as little as possible in favor of variable types.
    • Use $$eval all the time. It's the most generally useful scraping function in Puppeteer, avoiding an ugly Array.from(document.querySelectorAll) or element handles. If Puppeteer's locators API matures in the future, $$eval may be supplanted, but for now it's the way to go.
    • ?? undefined is unnecessary. If the left hand chained operator ?. fails, the expression evaluates to undefined anyway, so undefined ?? undefined is pointless.
    • Generally speaking, you don't need addScriptTag or exposeFunction. If you're writing $text-like abstractions often, you can jQuery or something like that to simplify querying.
    • In web scraping, there are no silver bullets for selection, so be very cautious when attempting to generalize--it's not impossible, but requires a good deal of care and case-specific planning.