Search code examples
javascriptpythonweb-crawlerpuppeteerapify

Memory issue on using Apify Puppeteer crawling


I have been working on a Python project where the users provide a long list of URLs to the program (lets say 100 URLs), and the program will spawn 100 processes to execute the JavaScript code which contains the crawler code (using Apify.launchPuppeteer()). Moreover, the JavaScript code is created and modified based on the Apify Puppeteer single page template.

However, concurrently calling the crawling code by the 100 processes uses up a lot of memory, which leads to lagging. Since the Python code is waiting to read the result from the file that's written by the JavaScript code, insufficient memory greatly affects the performance and raises errors on file writing. I was wondering is there any ways to optimize the JavaScript crawler code, or if there are any improvements that can be made on both sides?

Some edits --- Extra information on the program: user is giving a list of URLs (domains) and the program wants to crawl through all the links in the domain (e.g. crawl all hyperlinks one of the domain github.com), recursively.


Solution

  • Launching 100 separate crawling processes is totally unnecessary. Apify provides crawler classes that can scrape a list or queue full of URLs. They also manage concurrency so the run stays within CPU and memory limits. We commonly scrape millions of URLs without significant memory or CPU issues. I would use PuppeteerCrawler.