I'm working on a project for which I'm evaluating both Scrapy and Apify. Most of the code centers around node.js so a javascript solution would be nice. Also, I like the fact that I can use puppeteer in Apify. That said, my use-case requires doing fairly shallow (e.g. depth of roughly 4) crawls of many web sites. This is easy to configure in Scrapy, but I can't figure out how to do it in Apify. Is there a way to specify max depth in the new Apify API? It looks like this was a parameter in their legacy crawler, but I haven't found it in the new API.
There are two approaches you can take. Firstly, you can use the Puppeteer Scraper public actor, which enables you to use most of Apify SDK's features in a simplified form and the max crawl depth configuration is available there as a simple input under the Performance and limits section. To learn the basics, visit the introduction tutorial.
The second approach is more involved and uses the Apify SDK directly. With all your requests, you can pass arbitrary user data down using the request.userData
property. This way, before you add more pages to the crawling queue, you can check if you'd not reached the desired depth:
const MAX_DEPTH = 4;
// When creating the request queue, we seed the first request with a depth of 0.
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({
url: "https://stackoverflow.com",
userData: {
depth: 0,
}
});
// ...
// Then, somewhere in handlePageFunction, when adding more requests to the queue.
if (request.userData.depth < MAX_DEPTH) {
await requestQueue.addRequest({
url: "https://example.com",
userData: {
depth: request.userData.depth + 1,
}
});
}