Search code examples
web-scrapingapify

How to scrape dynamic-loading listing and individual pages using Apify?


How do I use the features of Apify to generate a full list of URLs for scraping from an index page in which items are added in sequential batches when the user scrolls toward the bottom? In other words, it's dynamic loading/infinite scroll, not operating on a button click.

Specifically, this page - https://www.provokemedia.com/agency-playbook I cannot make it identify any other than the initially-displayed 13 entries.

These elements appear to be at the bottom of each segment, with display: none changing to display: block at every segment addition. No "style" tag here is visible in raw source, only via DevTools Inspector.

<div class="text-center" id="loader" style="display: none;">
    <h5>Loading more ...</h5>
</div>

Here is my basic setup for web-scraper...

Start URLs:

https://www.provokemedia.com/agency-playbook
{
  "label": "START"
}

Link selector:

div.agencies div.column a

Pseudo URLs:

https://www.provokemedia.com/agency-playbook/agency-profile/[.*]
{
  "label": "DETAIL"
}

Page function:

async function pageFunction(context) {
    const { request, log, skipLinks } = context;
    // request: holds info about current page
    // log: logs messages to console
    // skipLinks: don't enqueue matching Pseudo Links on current page
    // >> cf. https://docs.apify.com/tutorials/apify-scrapers/getting-started#new-page-function-boilerplate



    // *********************************************************** //
    //                          START page                         //
    // *********************************************************** //
    if (request.userData.label === 'START') {
        log.info('Store opened!');
        // Do some stuff later.
    }


    // *********************************************************** //
    //                          DETAIL page                        //
    // *********************************************************** //
    if (request.userData.label === 'DETAIL') {
        log.info(`Scraping ${request.url}`);
        await skipLinks();
        // Do some scraping.
        return {
            // Scraped data.
        }
    }
}

Presumably, inside the START stuff, I need to ensure to reveal the whole list to queue up, more than just the 13.

I have read through Apify's docs, including on "Waiting for dynamic content". await waitFor('#loader'); seemed like a good bet.

I added the following to the START portion...

    let timeoutMillis; // undefined
    const loadingThing = '#loader';
    while (true) {
        log.info('Waiting for the "Loading more" thing.');
        try {
            // Default timeout first time.
            await waitFor(loadingThing, { timeoutMillis });
            // 2 sec timeout after the first.
            timeoutMillis = 2000;
        } catch (err) {
            // Ignore the timeout error.
            log.info('Could not find the "Loading more thing", '
                + 'we\'ve reached the end.');
            break;
        }
        log.info('Going to load more.');
        // Scroll to bottom, to expose more
        // $(loadingThing).click();
        window.scrollTo(0, document.body.scrollHeight);
    }

But it didn't work...

2021-01-08T23:24:11.186Z INFO  Store opened!
2021-01-08T23:24:11.189Z INFO  Waiting for the "Loading more" thing.
2021-01-08T23:24:11.190Z INFO  Could not find the "Loading more thing", we've reached the end.
2021-01-08T23:24:13.393Z INFO  Scraping https://www.provokemedia.com/agency-playbook/agency-profile/gci-health

Unlike other web pages, this page does not scroll to bottom when I manually enter window.scrollTo(0, document.body.scrollHeight); into the DevTools Console.

However, when manually executed in Console, this code to add a small delay - setTimeout(function(){window.scrollBy(0,document.body.scrollHeight)}, 1); - as found in this question - does jump to the bottom each time...

If I add that line to replace the last line of the while loop above, however, the loop still logs that it could not find the element.

Am I using the methods? Not sure which way to turn.


Solution

  • @LukášKřivka's answer at How to make the Apify Crawler to scroll full page when web page have infinite scrolling? provides the framework for my answer...

    Summary:

    • Create a function to instigate force scrolling to the bottom of the page
    • Get all elements

    Detail:

    • In a while loop, scroll to the bottom of the page.
    • Wait for eg. 5 secs for new content to render.
    • Keep a running count of the number of target-link selectors, for info.
    • Until no more items load.

    Call this function only when pageFunction is examining an index page (eg. arbitrary page name like START/LISTING in User Data).

    async function pageFunction(context) {
    
    
    
        // *********************************************************** //
        //                      Few utilities                          //
        // *********************************************************** //
        const { request, log, skipLinks } = context;
        // request: holds info about current page
        // log: logs messages to console
        // skipLinks: don't enqueue matching Pseudo Links on current page
        // >> cf. https://docs.apify.com/tutorials/apify-scrapers/getting-started#new-page-function-boilerplate
        const $ = jQuery;
    
    
    
    
    
        // *********************************************************** //
        //                Infinite scroll handling                     //
        // *********************************************************** //
        // Here we define the infinite scroll function, it has to be defined inside pageFunction
        const infiniteScroll = async (maxTime) => { //maxTime to wait
    
            const startedAt = Date.now();
            // count items on page
            let itemCount = $('div.agencies div.column a').length; // Update the selector
    
            while (true) {
    
                log.info(`INFINITE SCROLL --- ${itemCount} items loaded --- ${request.url}`)
                // timeout to prevent infinite loop
                if (Date.now() - startedAt > maxTime) {
                    return;
                }
                // scroll page x, y
                scrollBy(0, 9999);
                // wait for elements to render
                await context.waitFor(5000); // This can be any number that works for your website
                // count items on page again
                const currentItemCount = $('div.agencies div.column a').length; // Update the selector
    
                // check for no more
                // We check if the number of items changed after the scroll, if not we finish
                if (itemCount === currentItemCount) {
                    return;
                }
                // update item count
                itemCount = currentItemCount;
    
            }
    
        }
    
    
    
    
    
    
        // *********************************************************** //
        //                          START page                         //
        // *********************************************************** //
        if (request.userData.label === 'START') {
            log.info('Store opened!');
            // Do some stuff later.
    
            // scroll to bottom to force load of all elements
            await infiniteScroll(60000); // Let's try 60 seconds max
    
        }
    
    
        // *********************************************************** //
        //                          DETAIL page                        //
        // *********************************************************** //
        if (request.userData.label === 'DETAIL') {
            log.info(`Scraping ${request.url}`);
            await skipLinks();
            
            // Do some scraping (get elements with jQuery selectors)
    
            return {
                // Scraped data.
            }
        }
    }