performance scrapy web-crawler distributed-system

How to build a powerful crawler like google's?

I want to build a crawler which can update hundreds of thousands of links in several minutes. Is there any mature ways to do the scheduling? Is distributed system needed? What is the greatest barrier that limits the performance? Thx.

Solution

For Python you could go with Frontera by Scrapinghub

https://github.com/scrapinghub/frontera

https://github.com/scrapinghub/frontera/blob/distributed/docs/source/topics/distributed-architecture.rst

They're the same guys that make Scrapy.

There's also Apache Nutch which is a much older project. http://nutch.apache.org/