Search code examples
web-scrapingweb-crawlerpuppeteerapify

How to lable URLs acording to Start URLs in APIFY


Using Web Scraper you can input multiple Start URLs. You can label them using Glob Patterns or Pseudo-URLs.

I have multiple URLs to crawl, but they can't be distinct using Glob Patterns or Pseudo-URLs.

The only option I was thinking of is to split it into multiple tasks.

Is there a better way?


Solution

  • If you have a page with links that have the same selectors, but you want to distinguish them by the heading that they were under, using simply the selectors, globs and pseudoURLs is usually not enough, so you'll need to write your own logic in the page function.

    Let's take an example of docs.apify.com as a page where you might want to do that:

    Screenshot of Apify documentation homepage

    Now let's say you want to scrape the links under Guides and Platform features, but using different handlers. To do that, you have to go through the section boxes one by one, for each check the title, and then eventually enqueue the links. This code in the pageFunction should work:

    async function pageFunction(context) {
        const { label } = context.request.userData;
        if (label === "GUIDES") {
            context.log.info(`Handling guide link: ${context.request.url}`);
            // ...
            return;
        }
        if (label === "PLATFORM") {
            context.log.info(`Handling platform feature link: ${context.request.url}`);
            // ...
            return;
        }
    
        const $ = context.jQuery;
        for (const cardElement of $("[data-test=MenuCard]").toArray()) {
            const title = $("h2", cardElement).text().trim();
            let label;
            if (title.toLowerCase() === "guides") {
                label = "GUIDES"
            } else if (title.toLowerCase() === "platform features") {
                label = "PLATFORM"
            }
            if (label !== undefined) {
                for (const link of $(".card-body-wrap li a", cardElement)) {
                    await context.enqueueRequest({
                        url: new URL(link.href, context.request.loadedUrl).toString(),
                        userData: {
                            label,
                        }
                    });
                }
            }
        }
    }