Search code examples
phpweb-scrapingscreen-scraping

Are Web scrapers limited by CPU, RAM or IO?


When a web scraper is written in PHP (running off nginx, ubuntu), and we want to have many scrapers scraping many different sites at a time, what will be the limiting factor?

CPU, RAM or Disk IO?


Solution

  • RAM and Disk IO will likely limit long before CPU, depending on how may simultaneous processes you have running. Each scraper will probably maintain an associative array of visited URLs and found resources. For large sites this will be... large, especially if you allow 4k for each URL and store it raw.

    You will probably hash the URL (40 byte GUID, or smaller binary representation), so that will/can save a lot of RAM.

    Avoid disk I/O as much as you can, writing only when absolutely necessary to mitigate its impact, and consider writing to a DB instead of writing to a disk file that may be a network mount.