How do I use the features of Apify to generate a full list of URLs for scraping from an index page in which items are added in sequential batches when the user scrolls toward the bottom? In other words, it's dynamic loading/infinite scroll, not operating on a button click.
Specifically, this page - https://www.provokemedia.com/agency-playbook I cannot make it identify any other than the initially-displayed 13 entries.
These elements appear to be at the bottom of each segment, with display: none
changing to display: block
at every segment addition. No "style
" tag here is visible in raw source, only via DevTools Inspector.
<div class="text-center" id="loader" style="display: none;">
<h5>Loading more ...</h5>
</div>
Here is my basic setup for web-scraper...
Start URLs:
https://www.provokemedia.com/agency-playbook
{
"label": "START"
}
Link selector:
div.agencies div.column a
Pseudo URLs:
https://www.provokemedia.com/agency-playbook/agency-profile/[.*]
{
"label": "DETAIL"
}
Page function:
async function pageFunction(context) {
const { request, log, skipLinks } = context;
// request: holds info about current page
// log: logs messages to console
// skipLinks: don't enqueue matching Pseudo Links on current page
// >> cf. https://docs.apify.com/tutorials/apify-scrapers/getting-started#new-page-function-boilerplate
// *********************************************************** //
// START page //
// *********************************************************** //
if (request.userData.label === 'START') {
log.info('Store opened!');
// Do some stuff later.
}
// *********************************************************** //
// DETAIL page //
// *********************************************************** //
if (request.userData.label === 'DETAIL') {
log.info(`Scraping ${request.url}`);
await skipLinks();
// Do some scraping.
return {
// Scraped data.
}
}
}
Presumably, inside the START stuff, I need to ensure to reveal the whole list to queue up, more than just the 13.
I have read through Apify's docs, including on "Waiting for dynamic content". await waitFor('#loader');
seemed like a good bet.
I added the following to the START portion...
let timeoutMillis; // undefined
const loadingThing = '#loader';
while (true) {
log.info('Waiting for the "Loading more" thing.');
try {
// Default timeout first time.
await waitFor(loadingThing, { timeoutMillis });
// 2 sec timeout after the first.
timeoutMillis = 2000;
} catch (err) {
// Ignore the timeout error.
log.info('Could not find the "Loading more thing", '
+ 'we\'ve reached the end.');
break;
}
log.info('Going to load more.');
// Scroll to bottom, to expose more
// $(loadingThing).click();
window.scrollTo(0, document.body.scrollHeight);
}
But it didn't work...
2021-01-08T23:24:11.186Z INFO Store opened!
2021-01-08T23:24:11.189Z INFO Waiting for the "Loading more" thing.
2021-01-08T23:24:11.190Z INFO Could not find the "Loading more thing", we've reached the end.
2021-01-08T23:24:13.393Z INFO Scraping https://www.provokemedia.com/agency-playbook/agency-profile/gci-health
Unlike other web pages, this page does not scroll to bottom when I manually enter window.scrollTo(0, document.body.scrollHeight);
into the DevTools Console.
However, when manually executed in Console, this code to add a small delay - setTimeout(function(){window.scrollBy(0,document.body.scrollHeight)}, 1);
- as found in this question - does jump to the bottom each time...
If I add that line to replace the last line of the while loop above, however, the loop still logs that it could not find the element.
Am I using the methods? Not sure which way to turn.
@LukášKřivka's answer at How to make the Apify Crawler to scroll full page when web page have infinite scrolling? provides the framework for my answer...
Summary:
Detail:
while
loop, scroll to the bottom of the page.Call this function only when pageFunction is examining an index page (eg. arbitrary page name like START/LISTING in User Data).
async function pageFunction(context) {
// *********************************************************** //
// Few utilities //
// *********************************************************** //
const { request, log, skipLinks } = context;
// request: holds info about current page
// log: logs messages to console
// skipLinks: don't enqueue matching Pseudo Links on current page
// >> cf. https://docs.apify.com/tutorials/apify-scrapers/getting-started#new-page-function-boilerplate
const $ = jQuery;
// *********************************************************** //
// Infinite scroll handling //
// *********************************************************** //
// Here we define the infinite scroll function, it has to be defined inside pageFunction
const infiniteScroll = async (maxTime) => { //maxTime to wait
const startedAt = Date.now();
// count items on page
let itemCount = $('div.agencies div.column a').length; // Update the selector
while (true) {
log.info(`INFINITE SCROLL --- ${itemCount} items loaded --- ${request.url}`)
// timeout to prevent infinite loop
if (Date.now() - startedAt > maxTime) {
return;
}
// scroll page x, y
scrollBy(0, 9999);
// wait for elements to render
await context.waitFor(5000); // This can be any number that works for your website
// count items on page again
const currentItemCount = $('div.agencies div.column a').length; // Update the selector
// check for no more
// We check if the number of items changed after the scroll, if not we finish
if (itemCount === currentItemCount) {
return;
}
// update item count
itemCount = currentItemCount;
}
}
// *********************************************************** //
// START page //
// *********************************************************** //
if (request.userData.label === 'START') {
log.info('Store opened!');
// Do some stuff later.
// scroll to bottom to force load of all elements
await infiniteScroll(60000); // Let's try 60 seconds max
}
// *********************************************************** //
// DETAIL page //
// *********************************************************** //
if (request.userData.label === 'DETAIL') {
log.info(`Scraping ${request.url}`);
await skipLinks();
// Do some scraping (get elements with jQuery selectors)
return {
// Scraped data.
}
}
}