Search code examples
performancescrapyweb-crawlerdistributed-system

How to build a powerful crawler like google's?


I want to build a crawler which can update hundreds of thousands of links in several minutes. Is there any mature ways to do the scheduling? Is distributed system needed? What is the greatest barrier that limits the performance? Thx.


Solution

  • For Python you could go with Frontera by Scrapinghub

    https://github.com/scrapinghub/frontera

    https://github.com/scrapinghub/frontera/blob/distributed/docs/source/topics/distributed-architecture.rst

    They're the same guys that make Scrapy.

    There's also Apache Nutch which is a much older project. http://nutch.apache.org/