Search code examples
puppeteerapify

With Apify/Puppeteer, crawl all URLs except those that contain a word


With Apify/Puppeteer, how can I crawl all pages except those that include a certain word?

Inside of the handlePageFunction, the original code looks like this

        await Apify.utils.enqueueLinks({
            requestQueue,
            page,
            pseudoUrls: [
                baseurl + '[.*]',
            ],
        });

This crawls all pages. If I want to avoid page URLs that contain "foo", is there anyway I could adjust something within pseudoUrls to fix that?


Solution

  • As per Apify documentation for PseudoUrls:

    A PURL is simply a URL with special directives enclosed in [] brackets. Currently, the only supported directive is [RegExp], which defines a JavaScript-style regular expression to match against the URL.

    Therefore you can include a regex that would prevent matching urls that contain foo by embedding a regular expression with negative lookahead at the front, like this:

    await Apify.utils.enqueueLinks({
        // ...
        pseudoUrls: [
            '[(?!.*foo)]' + baseurl + '[.*]',
        ],
    });
    

    What this does:

    • the square brackets [ + ] mean that this part of the pseudoUrl is an embedded regex
    • (?! + ) denominates a negative lookahead group in a regular expression. This means that if the sub-regex contained inside matches, a match is refused for the main (outer) regex.
    • .* means that any characters may precede the string that you want to avoid matching
    • foo is the string you want to avoid matching